BOCOBEiT BESOHE 



BD 178 5<»8 



XB 007 S39 



AOTHOS 
TITLS 

18SI1T0TIOU 

fiBPORT MO 
POB DATE 
SOTB 

EDRS PRICE 
DESCEIPTOSS 



IDENTIFIERS 



Svezey, Robert W. ; And Others 

Criterion-Ref er€T»C€3 Testing: A Discussioc of Theor;y 
and Practice in the Army. 

Army Research Inst, for the Behavioral and Social 

Sciences, Arlington, Va, 

ARI-BM-75~11 

Dec 75 

95p»; Appendices aarqinally legible 
«P01/PC0q Plus Postage. 

♦Criterion Referenced Tests; Evaluation Keeds? 
iJastery Tests; *Military Training; Heeds Assessment; 
Research Needs; ♦Test Construction; *tresting; Testing 
Problems; Ose Studies 
♦Array 



ABSTRACT 

As the basis fcr developing a critericn referenced 
test (CRT) construction manual for the Army and for identifying 
potential research areas, a study was conducted which included a 
review of the technical and theoretical literature cn criterion 
referenced testing and a survey of CRI applications at selected Army 
installations. It was found that the use of CRT's was limited, 
although some serious attenipts were being lade to develop and 
administer them. Progress was noted in such areas as eguipment 
related skills, but little evidence of CRT development was found in 
••soft skill" areas or in team performance situations. There was 
general consensus that cl€ar3 y-written CRT construction guides were 
needed. Difficulties were observed in CRT development and use: lack 
of task analysis data and well-defined objectives; inattention to 
prioritizing tasks; disregard for practical constraints; insufficient 
number of items in the item bank for axternate test f crms and lack of 
item analysis techniques; omission of test reliability and validity 
stuaies; and lack of standardized testing conditions. <A lengthy 
bibliography and. appendices, including the interview form, summary of 
types of indivi4uals interviewed, and quantitative data gathered at 
each installa^on, are prcvided). (MH) 
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CRITERIOK-REFERENCED TESTING: A DISCUSSION OF THEORY AND PRACTICE IN 
THE ARMY 



INTRODUCTION 

This ireport is an interim document dealing vith development of a 
Criterion-Referenced Test (CRT) Construction Manual* The major objec- 
tives of the study vere the development of an easy-to*use, ^"how- to-do*i t" 
manual to assist Army test developers ixi the construction of CRTs, and the 
identification of needed research to help achieve a more consistent, 
unified criterion-referenced test model* 

In order to accomplish these objectives, the study: surveyed the liter- 
ature on criterion-referenced testing in order to provide an Information 
base for development of the CRT Construction Manual; visited selected Army 
posts to review the present status of criterion-referenced test construction 
and application in the Army; prepared a draft CRT construction manual; 
conducted a trial application of the draft manual; and revised the CRT 
construction manual* The manual for developing criterion- referenced tests 
has been published as an ARl Special Publications Guidebook for Developing 
Criterion-Referenced Tests. 

Part 1 of this report reviev?s the technical and theoretical literature 
in criterion-*referenced testing. This review is a serious discussion of the 
state-of-the-art in criterion-referenced testing, designed for the acade- 
mically-oriented reader. The review disc\isses questions of CRT reliability 
and validity in both practical and theoretical areas, different methods of 
CRT construction, simulation fidelity (e*g,, the extent to vhich CRTs can 
and should mirror real-world performance conditions), the use of CMs in 
mastery learning contexts and to test development and item sampling, diag- 
nostic uses of CRTs, the establishment of passing scores, and uses of CRTs in 
public education and military contexts. 

Part 2 describes a survey of Army CRT applications at a number of Army 
installations. Results of the survey are indicated through an analysis of 
quantitative data collected during interviews and through a discussion of 
qualitative comments received, problems observed, and areas where changes may 
prove beneficial to the Army* 

Appendices A, B, and C provide, respectively, the Interview Protocol used 
during the Army CRT survey; a summary of types of individuals interviewed 
at each Army installation surveyed; and quantitative data gathered at each 
Army post. 



PART 1--REVIEW OF TECHNICAL AND IHEORETICAt LITERATURE 

Criteri on-referenced testing (CRT) has been videly discussed since the 
term vas popularized by Robert Glaser in IS&p. In CRT, questions in- 
volving comparisons among individuals are largely irrelevant, CRT informa* 
tion is usually used to evaluate the student's mastery of instructional 
objectives, or to approximately locate him for future instruction (Glaser 
and NitkOj X$tfX). A CRT has been defined variously in the literature, in 
fact definitions vary so videly that a given test may be classified as 
aither a CRT or a norm- referenced test (NRT) according to the particular 
definition used. Glaser and Nitko (1971) propose a flexible definitioni 

'*A CRT is one that is deliberately constructed so as to 
yield measurements that are directly interpretable in 
terms of specified performance standards. • The per- 
formance standards are usually specified by defining 
some domain of tasks that the student should perform* 
Representative samples of tasks from this domain are 
organized into a test. Measurements are taken and are 
used to make a statement about the performance of each 
individuc^T relative to that domain*" 

Common to all definitions is the notion that a well-defined content domain 
and the development of procedures for generating appropriate samples of 
test items are important. Lyons {X<Jj2) argues for the use of criterion* 
referenced moasuronent as a vital part of training quality control: 

quality control requires absolute rather than relative 
criteria* Scores and grades must reflect hov many course 
objectives have been mastered rather than how a student 
compares with other students 

For the purposes of this review, a CRT will be defined as a test where 
the score of an individual is interpreted against an external standard 
(e*g-, a standard other than the distribution of scores of other testees). 
Further, CRTs are tests whose items are operational definitions of behavioral 
objectives* 

The contemporary interest is mastery learning has led to a growing 
interest in CRT* CRTs can bo used to serve two purposes: 

!♦ They can bo used to provide specific information about the 

performance levels of individuals on instructional objectives. 
This inforniation can be used to support a decision as to 
'Wstery'* of a particular objective (Block, lif(l) , 
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2, Ihey can be used to evaluate the effectiveness of instruction, 
HRTs given at the end of a course are less useful for making 
evaluative decisions of the effectiveness of instruction 
because they are not derived from the particular task objectives. 
CRT is, hovever, easeful for the evaluation of instruction 
because of the specificity of the results to the task objectives 
(Lord, 1962; Cronbach, 1965; Shoemaker, 1970a, 1970b; KanVuletoiii 
Rovinelli, and Gorth, 1971)* 

Popham (I975) points out a basic concern vith the instrument itself: 

*Ve have not y^t made an acceptable effort to delineate 
the defining iimensions of performance tests, in terms 
of their content, objectives, post- test nature, back- 
ground information level, etc. Almost all of the recently 
developed performance tests have been devised more or less 
on the basis of experience and instruction,'* 

Ebel (1971) poses a series of arguments against the use of CRT in 
education, Ebel points out vith some justification that CRT measures do 
not tell us all ve need to knov about educational achievement, pointing 
out that CRT measures are not efficient at discovering relative strengths 
and deficiencies • This is true and is an excellent case for combining 
CRT vith NRT in cases where both relative and absolute information must 
be gathered • Ebel also raises an objection shared by many practicing 
educators to the vhole ^*sys terns*' approach to educational development* 
That is, objectives specific enough to support the generation of CRT are 
more likely to suppress than to stimulate *'good teaching*'. Ebel leaves 
us, hovever, vithout a metric capable of defining ^'good teaching*' and the 
untenable assumption that **good teaching'* is the rule. Finally^ Ebel 
confuses the concept of mastery of material vith the practice of using 
percentile grades as pass^fail measures* Ebel does not address the notion 
that CRT as currently constructed are the result of the application of a 
carefully thought out analysis and development system. 



RELIABILITY AND VALIDITY 

As Glaser and Nitko (ly7l) point out, the appropriate rechnique for 
an empirical estimation of CRT reliability is not clear. Popham and Husek 
(196^) suggest the traditional MT estimates of internal consistency and 
stability are not often appropriate because of their dependency on total 
test score variability* CRTs typically are interpreted in an absolute 
fashion, hence, variability is drastically reduced. CRTs must be internally 
consistent and stable, yet estimates of indexes that are dependent on score 
variability may not reflect this* This section^ vill critically examine a 
number of studies vhich have addressed the question of reliability* The 
question of validity of CRTs is inextricably mingled vith the reliability 
issue and also presents many facets of opinion and theory. Various 
positions concerning reliability and validity vill be discussed in turn* 

Cox and Vargas (1^66) compared the results obtained from tvo item 
analysis procedures using both pre^test and post-test scores; a Difference 
Index (DIj vas obtained in two ways* A post-test minus pte-tost DI vas 
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obtained by subCractlng the perci^ntage of students vho passed m item on 
the pre^test from the percentage vho passed on the post-test. Also a Dl 
vas obtained in the more conventional manner* After post-test, the 
distribution of scores vas divided into the upper third and the lower 
third, then the percentage of students in the lower third was subtracted 
from the percentage of students in the upper third. Ihe Spearman Rho's 
obtained between the two Dl*s were of a moderate order. Ihe authors con* 
eluded that their DI differed sufficiently from the traditional method to 
warrant its use with CRTs* Hambleton and Gorth (1971) replicated the work 
of Cox and Vargas (1966) and found that the choice of statistic does indeed 
have a significant effect on the selection of test items. Ihe change in 
item difficulty from presto post-test seems particularly attractive where two 
test administrations are possible* Unfortunately, however, this method uses 
statistical procedures dependent on score variability which are qu#*?tionable 
for CRT (Popham and Husek, 19^9; Randall, 1972) particularly if if i ^ to be 
employed for item selection (Oakland, 1972) . 

Livingston (1972a) acknowledges Popham and Husek^s comment that 
''the typical Indexes of internal consistency are not appropriate for 
criterion*^ referenced tests^*. Nevertheless, Livingston feels that the 
classical theory of true and error scores can be used in determining CRT 
reliability* Livingston points out that '^when we use criterion-referenced 
measures we want to know how far*..*i^1 score deviates from a fixed 
standard/* In Livingstones model, each concept based on deviations from a 
mean score is replaced by a corresponding concept based on deviations from 
the criterion score ^ In this view, criterion-referenced reliability can be 
interpreted as a ratio of mean squared deviation from the criterion score* 
If this view is accepted, a number of useful relationships are provided; 
for instance, the further a mean score is from the criterion score, the 
greater the criterion-referenced reliability of the test for that particular 
group* In effect, moving the mean score away from the criterion score has 
the same effect on criterion-referenced reliability that increasing the 
variance of true scores has on norm^referenced reliability. In other words 
errors of misclassification of the false negative variety can be minimized 
by accepting as true masters the group that ccwnfortably exceeds the required 
criterion level. Another point is that if we accept Livingstones model> 
then the criterion- referenced correlation between two tests depends on the 
dirrictiity level of the tests for the particular group involved* Two tests 
can have a high correlation only if each is of similar difficulty for the 
group of students ♦ Ihis provides an effective limitar^\on for the computa- 
tion of inter-item correlations as it is often difficult to ensure equal 
difficulty levels, which must fluctuate with the group being tested* 

Regarding Livingstones {VJJ2^) proposal that the psychometric theory of 
true and error scores could be adapted to CRT, Oakland (1)/1[2) comKiented 
that the procedures seemed viable but that the conditions under which they 
could be used were overly restrictive. 

Harris (Vf(2) objects to Livxngston*s (ly72a) application of classical 
psychometric theory to CRT, pointing out that whether Livingston's coefficient 
or a traditional one is applied, the standard error of measur:^*r.ent remains 
the same* llie fact that Livingston's coefficient is usually the larger does 
not* mean a more dependable determination of whether or not a true score 
falls above or below the criterion score. As a rebuttal, Livingston {l<-^yj2h) 



indicates that Harris overlooked the point that reliability is not a prop- 
erty of a single score but of a group of scores. Livingston also points out 
that the larger criterion-referenced reliability does imply a more depend- 
able overall determination* when this decision is to be made for all 
individual scores in the distribution. 

Meredith and Sabers (1972) also take issue with Livingston's concept 
of CRT reliability estimation as variability around the criterion score, 
pointing out that CRT is concerned primarily with the accuracy of the 
pass-fail decision and is relatively unconcerned with a person's attainment 
above or below the criterion level* 

Roudabush and Green (1972) present an analysis of false positive and 
false negative to derive reliable estimates. These authors presented 
several methods for arriving at reliability estimates for CRT. The first 
involves ordering items into a hierarchical order of increasing difficulty. 
Roudabush and Green propose that error of measurement would be demonstrated 
if a student failed an easier item while passing a series of more difficult 
items. Oakland (iSTfS) points out that it is exceedingly difficult to 
establish the needed hierarchical order. Ihis objection has been raised since 
Guttman first (iSkh) proposed the technique of hierarchical ordering. Roudabush 
and Green propose a second technique utilizing point-biserial correlation 
between parallel tests. Their results with this method were far from encourag- 
ing. In addition, there is groat difficulty Inherent In the development of 
parallel tests. Ihe third method involves the use of regression equations 
to predict item criterion scores but has not yet been fully explored. 

In a divergent work, Hambleton and Novick (I97I) propose regarding CRT 
reliability as the consistency of decision-making across parallel forms of 
the CRT or across repeated measures. Ihey view validity as the accuracy of 
decision-making. This view departs from the classic psychometric view of 
reliability and validity and properly so, as the severly restricted variance 
. encountered with CRT will cause correlationally-based estimates of reliability 
and validity to be artificially low. Hambleton and Novick view a decision 
theoretic metric such as a "loss function" as being more appropriate for use 
on CRTs. This metric must serve to describe if an individual's true score 
is above or below a cutting score. Jhe concept differs markedly from 
Livingston's (1972a) notion in which the criterion regarded as the true 
score . 

Ihe Importance of' correct decision-making in CRT applications is also 
recognized by Edmonston, Randall, and Oakland (1^72) who present a CRT 
reliability model aimed at supporting decisions made during formative 
evaluation and maximizing the probability of learning an established set 
of objectives. Criterion-referenced items are usually binary coded pass- 
fail; therefore, summaries of group performance on two items of pre- and 
post- test can be displayed in a 0 x 2 contingency table. Edmonston ct al. 
recommend utilizing the cell proportions to provide irformation about the 
relationships between the variables represented by the table. Hiey find that 
a simple summation of the diagonal proportions E pa.a provides a very useful 

measure of agreement between categories--where a is a method of indicating 
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cells in a matrix and all cells have die same classification (pass«»fail)% 
Ihey also recommend a supplemental measure (Lambda) a variance*free 

coefficient, Goodman and I^skal (195**^) define \ 



where PM* and P,M are the modal class frequencies for each of the two 
cross*classifications. Xj. c^ay be interpreted as the relative reduction in 

the probability of error of classification when goind from a no-information 
situation to the other -method -known situation. Edmonston et al. feel the 
reliability estimate most useful to CRT is the extent to which they fluctu- 
ate temporally. They fell that, minimally > CRT iten ^ should provide stable 
estimates of knowledge of curriculum content j P<3ta and r can be used to 

provide estimates of this stability- They recommend that ^. paa be used to 

a 

judge the te-test reliability of each item. However, when item re- test 
reliability falls below an arbitrary criterion (Edmonston et al. recorranend 
Sy?^) and into a zone of decision, X is employed as a descriptive measure 

of the amount of information gained by employing a second item (the re-test) 
in making curriculum or placement decisions. If knowledge of the rentes t 
score provides additional information > the item is retainv^d. However ^ there 
is no current basis for determining the acceptable minimal reduction in 
classification error* 

In the same vein as Edmonston et al», Roudabush (I'/Ji) views reliability 
as referring to the appropriateness of the decisions made that affect the 
treatment of the examinee- Roudabush emphasizes '^Minimizing risk or cost 
to examinee.^' The decision iw whether to discontinue instruction or 
remediate or wash-out. 

As is the case with NRT development, determination of validity for 
CRT has seen less investigation than reliability. However, it s.cems logical 
that content validity must be the paramount concern for CRT development. 
According to Popham and Husek (I'jCyj) content validity is determined by "a 
carefully made judgement, based on tlie test's apparent relevance to the 
behaviors legitimately inferable from those delimited by the criterion,*' 

McFann (lyY5) views the content validation of training as having t^\H'> 
major dimensions. The first dimension is the role of the human within the 
general operating system* Generally, thi;=. is defined by means of task 
analysis. Tiie second dimension involves the skills and knowledge the 
trainee brings with him to the course; the training content can then bo 
viewed as a residual of what must still be irr^parted to the trainee. TJio 
decision of what to include in the training must also be tempered by manage- 
ment orientation to cost and effectiveness. Finally, McFann feels that 



decisions made on the units or procedures by vhith output is to be ev^lu*^ 
ated has an influence on validation of training content. McFann views the 
validation of training content as a dynamic , interactive process vhereby 
training content is initially determined and then^ on the basis of feedback 
of student performance on the job, instructional content as well as 
r truction method is modified to improve overall system effectiveness. 

Edmonston, Randall, and Oakland (1972) hold concent validation as 
central to CRT development. 'CRT items are sampled theoretically from a 
large item domain and mu^>t be representations of a specified behavioral! 
objective , 

Hambleton and Novlck ,ly7l) propose a validity theory in which a new 
test Y would serve as criterion. Tlie qualifying score of the second nest 
need not correspond with the qualifying score of the predictoi CRT. 
Test y these authors suggest might be derived from performance on the next 
unit of instruction, or it may be a job-related performance criterion. 
Although this appears to be a good idea> it seems that different conc:.usions 
would be reached if test Y were a job-related criterion instead of performance 
on the ne^cc unit of instruction. The fact that the conclusion might be 
different could, however, yield an approximation of convergent and diver- 
gent validity. Validation of a test determined by correlating it with 
another test may, hovover give a distinct ovei^estimate of *'validity'\ lliis 
is particularly true in the case vhere the tasks on che two tests are 
similar, 

Edmonston et al. ;i ' advocate a method of CRT validation vhich they 
term che cri terion-oriented approach, which includes both concurrent and 
predictive validity* In order co obtain complete information about an item 
and the objective it assesses, the relationship of a CRT to other measures 
should be considered i.e., ratings by teachers and training observers as 
veil as perfoi-mance on suitable NRT measures^ . Edmonston et al. view these 
as measures of concurrent validity, ^iLthough these multiple indicators 
could, if properly chose* provide an estimate of construct validity* la 
addressing the problems of predictive validation, Edmonston et al. concur 
with Kennedy * 1 i ,\ ^ in proposing that tests of curriculum mastery which 
represent higher order concepts taught within several curriculum units bu 
used as criteria against which unit test items would bo assessed as to their 
predictive power. In addition, unit test items vhich are more temporally 
proximate should agree more strongly with Mastery Test items than items 
sequenced earlier. Tliis notion has been partially verified by Edmonston and 
his co-workers. Fin^^l verification of this scheme of validity determination 
requires factorially pure items and thiSMtiay be a bit too much to ask of 
item writers. Edmonston et al. advocate an approach to construct validity 
initially put forth by Nunnally (V^tc). In Nunnally's view, the measure- 
ment and validation of a construct involve the determination of an internal 
network among a set of measures, and the consequent formation of a network 
of probability statements, lliis notion is not too far from Cronback and 
Mcehl's 1/ enunciation of the need for a '*nomological network'* wi ih 
vhich to validate a construct* Edmonston et a!* indicate that the **specifi- 
cation of a hierarchy of learning seti^ among items would seen to be the 
ultimate goal of construct /alidation procedures, enabling the development 
of internal and cross structures between items and the consequent understanding 
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of th* inter^rcilatlonihips of «X1 cuirleulum *reao*\ Ihtr concept ^oul4 be 
difficult to implements as the construction of learning sets is not an easy 
procedui^e. Also, difficulty can be expected in attempting the establishment 
of a network of relationships sufficient to completely define a constructs 

In RoudabutA^s (1975) view of validity, CRT items are designed to sample 
as purely as possible the specified domain of behavior, then tried out to 
determine primarily if the items are sensitive to instruction. A 2x2 contin- 
gency table containing ^ost-test and pre-test outcomes is the basis for 
lysis: 



Jos t^ test 



Pre- test 













1 * 






+ 




f. 








*2 + *4 





f^* failed both pre- and post 
f2* failed pre-, passed post- 
f^- passed pre-, failed post- 
f^^ passed both pre- and post 



Marks and Noll (i9t>7) assume f*^ due to guessing and derive a sensitivity 

index (s) that is simply the proportion of cases that missed the item on the 
pre-test and passed it on the post-test with correction for guessing. 



^1 ^2 
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Roudabush (1975)^ however, tovkid that to derive a "reasonably reliable" 
value for the index there should bfe 50 cases who missed the item at pre- test 
(£^), while if f^ cell is high the' 

*will the item)* This index ranges 



index will have little value (.neither 



from 1»00 to 0.00 but may go below 0,0 
if miskeyed. A problem her« may hk ensuring that different but parallel 
items are used for pre- and post-tasts, Ihis problem is a practical one,,, 
but is particularly acute when comalex content domains are contemplated. . 



These various treatments of CRT validity all exhibit difficulties • 
that often might prove insurmountable to a test constructor dealing with 
"real world" problems. Content va idity, however, is extremely important 
in CRT and can be reasonably ensured by careful attention to objective 
development. Construct validity will probably prove elusive If only due 
tc the complexity of operations and measures required to demonstrate this 
form of validity. Predictive validity appears practicable in many situations 



CONSTRUCTION MEIHODOLOGY 



NRTs are primarily designed to measure individual differences. The 
meaning which can be attached to any particular score depends on a compar- 
ison of that score to a relevant norm distribution. A norm-referenced test 
is constructed specifically to maximize the variability of test scores sinde 
such a test is more likely to produce fewer errors in ordering the individuals 
on the measured ability. Since NRTs are often used for selection and classi- 
fication purposes, it follows that minimizing the number of order errors is 
extremely important. 

NRTs are constructed using traditional item analysis procedures. It 
is partly because of this that the test scores cannot be interpreted rela- 
tive to some well-defined content domain since items are normally selected 
to produce tests with desired statistical properties (e.g., difficulty levels 
around .'^O, rather than to be representative of a content domain. Likewise, 
a wide range of item difficulty does not occur because of resulting variance 
restriction. Item homogeneity is also much sought in development of NRTs. 
Ihe ultimate purpose is to spread out individuals by maximizing the discrimina- 
ting power of each item. The emphasis is on comparing an individual's 
response with the responses of others, "mere is no interest in absolute 
measurement of individual skills as in CRTs, only relative comparison. 

Although conceptually allied to the construction of NRTs, item analysis 
is an important tool in assembling a test from an item pool and therefore has 
application to the construction of certain CRTs. Although content validity 
is an important characteristic for an item in a CRT, there are other impor- 
tant considerations having to do with the sensitivity and discriminating 
power of an item. Tiiese features are important when evaluating instruction 
and in ensuring the correct decision regarding an individual's progress 
through instruction. 

In CRT development, the item difficulty index is useful for selecting 
"good" items. However, item difficulty is used differently than in NRT. 
If the content domain is carefully specified, test items written to measure 
accomplishment of the objectives should also be carefully specified and 
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closely associated vlth the objectives • Hierefore^ all of the items 
associated vith th: same objectives should be iiiij^wered correctly by about 
the same proportion of exatninees in a group , Items ^hich differ greatly 
should be carefully examined to determine if they coincide with the intent 
of the objectives. 

Similarly » item discrimination indexes can be useful for CRT development • 
Negative discrimination indexes varn that CRT items need modification^ 
or that the instructional process is at fault* A negative index vould be 
indicative of a high proportion of ^'false negatives*^; conversely a positive 
discrimination index is usaful for diagnosing shortcomings in the instruc- 
tional program* 

An attempt to use item analysis techniques to develop test evaluation 
indexes vas undertaken by Ivens (1970)* Ivens defines reliability indexes ^ 
based on the concept of vithin equivalance of scores. Item reliability 
is defined as the proportion of subjects vhose item scores are the same on 
the post-test and either a re- test or parallel form. Score reliability is 
then defined as the average item reliability. Unfortunately the need for 
re-test or for two forms (parallel) would seem to reduce the usefulness of 
this scheme except in very special situations. 

Rahmlow> Matthews and Jung (1970) suggest that the function of a 
discrimination index in a CRT is primarily that of indicating the homoge- 
neity of the item with respect to the specific instructional objective 
measured. These authors focus attention on a shift in item difficulty from 
pre-instruction to post-instruction. 

Helmstadter [19J2) compared alternative indexes of item usefulness. 

1. Item discrimination based on high and low groups on a post- 
instructional measure, 

2* Shift in item difficulty from pre-to post-ins true tion* 

5* Item discrimination based on pre- and post- test performance. 

Shift in item difficulty from pre- to post-instruction produced results 
significantly more similar to the pre-post discrimination index than did 
the high-low group ppst-test discrimination index. 

Helmstadter also sought to compare the traditional item discrimination 
index applied to pre- and post-instruction with difficulty indexes derived 
in the same fashion* His findings confirmed that caution should be observed 
in the use of traditional item analysis procedures in CRT* In a similar 
finding, Roudabush (ly?5) showed that use of traditional item statistics 
would have resulted in some objectives being over-represented while others 
would be represented by no items. 
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Oseutie (19T1) hAS developed an elaborate model of subject response 
vhich he uses to derive an index of sensitivity • In this formulation the 
sensitivity of a group oi comparable measures given to a sample of S^^s 
before and after instruction is the variance due to the instructional 
effect divided by the sum of the variance due to the instructional effect 
and error variance, Ihe index vas^ however, developed for a severely 
restricted sairqp>le to allow an analysis of variance treatment* Further 
development is indicated before the technique has general usefulness for 
sensitivity measurement or item selection. 

New procedures have been developed for item analysis for specific 
cases of CRTs but evidence as to their generali^sability is lacking. If 
item analytic procedures are to be used in evaluating CRTs, then it must 
be known \Aiat sore , of score ,4s produced by that item* Ihe usual score is 
a pass-fail dichotomy. A C^T item can result in two types of incorrect 
decisions* Roudabush and Green (Z97^) refer to these errors as *'false 
positives" and ^*fal$e negatives" ♦ In this view^ reliability is concerned 
with the CRT's ability to consistent?.y make the same decision* Consequently, 
validity becomes the ability of the CRT to make the "right'' decision, i,e,, 
avoiding false negatives and false positives. The adequacy of a CRT in these 
authors' view is determined by its ability to discriminate consistently and 
appropriately over a large number of Items* 

Carver (1970) proposed two procedures to assess reliability of a CRT 
item* For a single form he suggests comparing the percentage meeting 
criterion level in one group to the same percentage in another ^'similar" 
group; for homogeneous sets he recommends using one group and comparing the 
percentages identified as meeting criterion on all items. Meredith and Sabers 
(1972) point out, however, that it must be determined how two CRT items, 
whether identical or parallel, identify the same individual with regard to 
his attainment of criterion level. With regard to item analysis procedures, 
if a CRT item is administered before and after instruction, and it does not 
discriminate, there are alternatives to labeling it unreliable • A non- 
discriminating item may simply be an invalid measure of the objectives or it 
may indicate that tue instruction itself is inadequate or unnecessary* 

Meredith and Sabers suggest the use of a matrix consisting of the pass- 
fail decisions of two CRTs* By defining the two CRT items as being the same 
measures we can ex^ine test/ro-test reliability, but without time inter- 
vening between the measures, the reliability is of the concurrent or internal 
consistency variety. In addition, undefined problems exist with acceptably 
defining two CRTs as the same* Various other indexes are possible but a 
great weight is placed upon carefully defining relationships between measures 
a priori. Considerable confusion is evidence in the use of '*same*' and 
parallel forms without formal definitions. Similarly it is stated that if 
one CRT item is a "criterion measure", then the validity of the other CRT 
can be found. By definition, both are criterion measures and if the '^criterion 
measure*' is external to the instructional domain, then it is not a CRT item 
in the same sense* Various coefficients are given but the difficulty in 
definition mentioned above limits their usefulness* 
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FIDELITY 



Ftederikseti (1962) has proposed a hierarchical model for describing 
levels of fidelity in performance evaluation. Freceriksen has identified 
six categories: 

1. Solicit opinions • This category, the lowest level> man in fact 
often miss the payoff questions (e*g*, to vhat extent has the beha\ior 
of trainees been modified as a function of the instructional process), 

2* Administer attitude scales. This technique, although psycho* 
metrically refined via the vork of thurstone, Likert, Guttman, and others ^ 
assesses primarily a psychological concept (attitude) which can only be 
presumed to be concomitant with performance, 

5* Measure knowledge. This is the most commonly used method of 
assessing achievement. This technique is usually considered adequate only 
if the training objective is to produce knowledge or if highly defined, 
fix^d procedure tasks are involved. 

Elicit related behavior* This approach is often used in situations 
where practicality dictates observation of behavior thought to be logically 
related to the criterion behavior, 

5, Elicit ^Vhat Would 1 Do'' behavior. This method involves presenta- 
tion of brief descriptions or scenarios of problem situations under simulated 
predesigned conditions; the subject is required to indicate how he would 
solve the problem if he were in the situation, 

6* Elicit lifelike behavior. Assessment under conditions which approach 
the realism of the real situation. 

Measurement at any of the six levels proposed by Frederiksen possesses 
both advantages and disadvantages. An optimal solution would be to assess 
individual performance at the highest possible level of fidelity. Unfortu- 
nately, deriving performance data may involve a subjective (rating) technique 
for a specific situation, requiring a subjectivity vs, fidelity tradeoff. 
In order to minimize subjectivity, it may be necessary to decrease the level 
of fidelity so that more objective measurements (such as time and errors) can 
be obtained* These measures can be conceptualiEed as surrogates that in some 
sense embody real criteria but have the virtue of measurability (Rapp> Root, 
and Sumner, 1\70), An actual increase in overall criterion adequacy may resu 
from a gain in objectivity which may compensate for a corresponding loss in 
fidelity* 

The question of fidelity addresses the issue of how much should the test 
tesemble the actual performance. Fidelity is not usually at issue in NRT and 
has its primary application in criterion- referenced performance tests. There 
are trades to be made between fidelity and cost. A more salient issue, 
however^ is how to empirically modify face fidelity to satisfy needs of the 
testing situation while retaining the essential stimuli and demand character* 
is tics of the real performance situation. 



Osborne (1970) addresses ptobletns in finding efficient alternatives to 
vork s iple tests. Osborn vas concerned vith developing a tnethodology that 
would allow derivation of cheaper procedures that would preserve content 
validity, Ihere are many realistic situations where Job sample tests are 
not feasible, and job-^knowledge tests are not relevant. Obviously the 
existence of intermediate measures would be a great boon to evaluating 
performance in this situation. However^ methods for developing intern- 
mediate or "synthetic*^ measures are lacking* Osborn gives a brief outline 
ox a method for developing these synthetic measuies. Osborn presents a 
two way matrix defined by methods of testing terminal performance (simple 
to complex) and component (enabling) behaviors. This matrix serves as a 
decision-making aid by allowing the test constructor to choose the test 
method most cost-effective for* each behavior. Ihe tradeoff that must be made 
between test relevance, related diagnostic performance data, and ease of 
administration and cost is obvious, and must be resolved by the judgement of 
the test constructor ♦ Osborn* s notions are intriguing but much more develop- 
ment is needed before a workable method for deriving synthetic performance 
tests is available* 

Vineberg and Taylor (1972) address a topic allied to the fidelity issue, 
that is: to what extent can job knowledge tests be substituted for perfor- 
mance test^. Practical considerations have often dictated the use of paper 
and pencil job knowledge tests because they are simple and economical to 
administer and easy to score. However, the use of paper and pencil tests to 
provide indexes of individual performance is often considered to be poor 
practice by testing ''experts'\ HumRRO research under Work Unit UTILITY 
compared the proficiency of army men at different ability levels and vith 
different amounts of job experience. Tliis work provided Vineberg and 
Taylor with an opportunity to examine the relationship between job sample 
test scores and job knowledge test scores in four U.S* Army jobs that 
varied greatly in job type and task complexity. Vineberg and Taylor found 
that job knowledge tests are valid for measuring proficiency in jobs where: 
1) skiir components are minimal ^ and job knowledge tests are carefully 
constructed to measure only that information that is directly relevant to 
performing the job at hand* Given the high costs of obtaining performance 
data^ these findings indicate that job knowledge tests are indicated whore 
skill requirements are determined by careful ,,i ob analysis to be minimal. 

In a similar work, Engcl and Rehder (T/fO) compared peer ratings, a 
job knowledge test, and a work-sample test. These workers found that while 
the knowledge tost Was acceptably reliable, it lacked validity^ and reading 
ability tended to enter into performance. Peer ratings were judged to 
have unacceptable validity. Ratings were also essentially uncorrelated 
with the vritten test* Hie troubleshooting items on the written test 
exhibited a moderate but useful*^ level of validity, while the corrective- 
action items had little validity* Finally, Engel and Rehder note that the 
work-*samplc is the most costly method and is difficult to adtiiinistar, while 
the peer ratings and written tests were the leas^ costly and were easy to 
administer. 



Osborn (1975) discusses an important topic related. to both the validity 
and fidelity of a CRT* Osborn points out that task outcomes and products 
are used to assess student performance vhile measures of hov the tasks are 
done (processes) pertain to the diagnosis of instructional systems. Time 
or cost factors sometimes preclude the use of product measures, thus leading 
process measures as the only available criteria. There are cases vhere this 
focus on process is legitimate and useful but many vhere it is not, Osborn 
developed three classes of tasks to illustrate what the relative roles of 
product and process measurement should be* 

1. Tawks vhere the product ij^ the process • 

2^ Tasks in vhich the product alvays follovs from the process. 

5* Tasks in vhich the product may follov from the process* 

* Relatively fev tasks are of the first type* Osborn offers gymnastic 
e:xerGises or springboard diving as examples* More tasks are of the second 
type, i,e*, fixed procedure tasks. In these tasksj if the process is 
correctly executed the product follovs* A great many tasks are of the third 
type vhere the process appears to have been correctly carried out but the 
product vas not attained. Osborn offers tvo reasons vhy this can happen: 
either, 1) ve vere unable to specify fully the necessary and sufficient steps 
in task performance > or 2) because ve do not or cannot accurately measure 
them. An example of aim-firing a rifle is given as an illustration that there 
is no guarantee of acceptable markmanship even if all procedures are followed* 
In this case^ process measurement vould not adequately substitute for product 
measurement. For tasks of the first tvo types ^ Osborn concludes that it 
really doesn't matter vhich measure is used to assess proficiency^ but for 
tasks of the third type, product measurement is indicated. Osborn, hovever, 
discusses a number of type 5 tasks vhere product measurement is impractical 
because of cost) danger > or practicality* In these cases process measures 
would come to be substituted x^^ith resulting injury to the validity of the 
measure* Osborn poses a salient question that the test developer must answer: 
If I use only a process measure to test a man*s achievement on a task, hov 
certain can I be from this process score that he vould also be able to 
achieve the product or outcome of the task? Osborn holds that vhere the degree 
of certainty is substantially less that that ro be expected by errors of 
measurement^ the test developer should pause and reconsider vays in vhich times 
and resources could be compromised in achieving at least an approximation to 
product tneasuroment * Osborn concludes by noting: The accomplishnient of 
product measurement is not alvays a simple matter; but it is a demanding and 
essential goal to be pursued by the performance test developer if his products 
are to be relevant to real vorld behavior* Sve2:ey (lyY^'O has also addressed 
process versus product measurement, and assist versus non-intarferenco methods 
of scoring in CRT development. Svezey has recomended process measurement 
in addition to, or instead of, product measurement vhen: Diagnostic informa* 
tion is dosiredj vhen additional scores are needed on a particular task, and 
when there is no product at the end of the process* 



An issue vhich must be faced vhea constructing a complex CRT is the 
bandvidth fidelity problem (Cronback and Gleser, 19^5 ) > i^e*, the question 
of whether to obtain precise information about a small number of competen- 
cies or less precise about a larger number. Hambleton and Novick (19T1) 
conclude that the problem of hov to fix the length of each sub*scale to 
maximize the percentage of correct decisions on the basis of test results 
has yet to be resolved or even satisfactorily defined • 

ISSUES RELATED TO CRT CONSTRUCTION 

Although construction methodology for NRT is well established and 
highly specified^ the construction of CRT has been much more of an art. 
There have been, however, several attempts to forraaliae the construction 
of CRT* Ebel (lyoS) describes the development of a criterion-referenced 
test of knowledge of \^ord meanings* Three steps were involved. 

!• Specification of the universe to which generalization is desired. 

A systematic plan for sampling from the universe. 

A standardised method of item development. 

'niese characteristics together servo to define the meaning of test scores. 
To the extent that scores are reproducible on tests developed independeiitly 
under the .same procedures, the scores may be said to have inherent meaning* 
Flanagan '1 indicates that a variant of Ebel's procedure was used in 
project TALENT* Tlic tests used in the areas of spelling, vocabulary, and 
reading vcro not based on specific objectives. Ihcy were^ however, developed 
by systematically sampling a relevant domain* Fromor and Anastasio 
also put forth a method for systematically generating spelling items from 
a specified domain* 

Osburn '1 notes two conditions as prerequisites for allowing 
inferences to be made about a domain of knowledge fi^om performance on a 
collection f items. 

!• All items that could possibly appear on a test should bo specified 
in advance* 

:\ Tlie items in a particular test should be selected by random 
sampling from the content universe* 

It is rarely feasible to satisfy the first conditions in any complete 
fashion for complex behavior domains. However, the problem of testing all 
items can bo overcome at least in a highly specified content area by the 
use of an i^tom form (Hively, Patterson^ and Page* ThJ^ I'^^y; Osborn^ 2 h» U 
The item form generally has the following characteristics lOsborn^ 

!• 3t generates items with a fixed syntactical structure. 

It contains one or more variable elements* 
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5. It defines a class of item sentences by specifying the replacement 
sets for the variable elements. 

Shoemaker and Osburn (I969) describe a computer program capable of 
generating both random and stratified random parallel tests from a veil* 
defined and rule-bound population. However» generalising these results to 
other domains has led to the finding that the difficulty of objectively 
defining a test construction process is directly related to the complexity 
of the behavior the test is designed to assess (Jackson, 1970), Where the 
domain is easily specified as in spellings the construction process iS 
simplified. 

It appears that at the current state-of-the-art, it is difficult to 
develop the objective procedures necessary for criterion- referenced 
measurement of complex behavior without doing violence to measurement 
objectives, What is needed for complex content domains are item generating 
rules that permit generalizations of practical significance to be made. 

Jackson (1970) concludes > "For complect behavior domains^ it appears that 
at least until explicit models stated in measurable terms are developed, a 
degree of subjectivity in test construction (and attendant population- 
referenced scaling) vill be required.*' The best approach appears to be the 
use of a detailed test specification which relates test item development 
processes to behavior. 

Edgerton (I9^(k) has suggested that the relationships among instructional 
methods, course content and item format have not been adequately explored. 
Item format should require thinking and/or performing in the patterns sought 
by the instructional methods . If the instruction is aimed at problem solving, 
then the items should address problem solving tasks and not, for example, 
knowledge about the required background content* Edgerton feels that if one 
mixes styles of items in the same test, one runs the risk of measuring 
*'test taking skill'' instead of subject matter competence. 

In a practical application, Osborn ^1^75) suggests fourteen steps in the 
course of developing a test for training evaluation. 'JThe first three steps 
have to do with c^ssembling information concerning the skills and knowledge 
segments, the relative importance of each objective, and the completeness of 
each objective. In step k the developer should obtain classification 
concerning measuring. of confusing elemento. Osborn points out that perfor- 
mance standards are generally a source of trouble. Steps concern them- 
selves with developing the test items and answering questions of the feasibility 
of simulation as well as questions of controlled administration. In step \^ 
a final aspect of measurement reliability is considered. Here procedures 
for translating observed performance into a pass-fail score must bo developed. 
Unfortunately, Osborn does not tell us how to develop pass-fail criteria that 
will generalize to trainees* performance in the fields In step 10 a 
Supplementary scoring procedure is developed for diagnosing reasons for trainee 
failure. Osborn does not say if this is to be a criterion- or norm-referenced 
interpretation • In step 11 the developer formats the final item with its 
instruction, scoring procedures, etc* In step i:^ a decision is made as* to 
whether time permits testing on all objectives or if a sample should be used* 




Step 3.5 covers sampling procedures based on the criticality of the behavior. 
In step 11; guidance for test administration is prepared* Osbom has provided 
the developer of CRTs vith a broad outline of the steps to be taken in item 
development. Unfortunately, he does not provide much detail on hov various 
decisions are to be made, i.e., what are passing scores, how to simulate, etc. 
It is the quality of these decisions that determines the usefulness of the 
final instrument but the decision-making process apparently remains an art. 

MASTERY LEARNING 

Besel (197;^a,b) contends that norm-group performance is useful and 
legitimate information for the construction and application of CRT. Besel 
defines a CRT as a set of items sampled from a domain which has been judged 
to l.e an adequate representation of an instructional objective. The domain 
should be fully described so as to allow two test developers to independently 
generate equivalent items which measure the same content and are equally 
reliable. A degree of arbitrariness creeps in when a mastery level is specified 
for a given objective or set of objectives. Basel recommends the "Mastery 
Learning Test Model" to provide an appropriate algorithm to support mastery/ 
non-mastery decisions. Two statistics arc computed: probability that a 

student has indeed achieved the objective and the proportion of a group which 
has achieved the objective. Hie model assumes that each student can be 
treated as either having achieved the objective or not having achieved the 
objective with partial achievement possible. The Mastery Learning Test Model 
and its underlying true score theory is related to a notion enunciated by 
Emrick (l-Jfl) , Emrick assumed that measurement error was attributable to two 
sources: Qf , the problem that a non-master will correctly ansxv'er an item 
("false positive") and f^, the probability that a master will give an 
incorrect ansx>.'er to an item ''"false negative"). Tliose constructs resemble the 
Type 1 and Typo II errors Gncountorod in discussions ol statibCical inference. 
Emrick' s model assumes that all item difficulties and inter-item correlations 
4re eq'^1, a difiicult assmption in view of the assumed variability of the 
former as a result of instruction and the difficulties in computing the latter. 
Bescl ^l // v a, b) had developed algorithms for estimating O' and p. Three 
data sources arc used: 

1. Item difficulties 

2. Intor-itcm co-variance 
5. Score histograms 

In a tryout, Bcsol reports "that the usage of an independent estimate 
of the proportion of students reaching mastery resulted in improved stability 
of Mastery Learning parameters." Tliis improved stability of A and B should 
promote increased confidence in mastery /non-mastery decision. Besel's 
computational procedures arc, however, quite involved, using a multiple 
regression approach which requires independent a priori estimates of variance 
due to conditions. Bosol also points out that B is estimated best for a 
group when the mastery level is lowered while the reverse is true for A. In 
other words, Besol has empirically established a relationship between errors 
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of misclassification and criterion level. > A decision^ hovever^ has not been 
made concerning the relative cost/effectiveness of the competing errors of 
misclassification. These decisions may have to be made individually for 
each instructional situation. 

ESTABLISHING AND CLASSIFYING INSTRUCTIONAL OBJECTIVES , 

*rhe development of student performance objectives for instructional 
programs has become a widespread and well-understood process throughout 
the educational community. For quality control of the conventional process 
crucial information derives directly from instructional objectives; they 
provide not only the specifications for instruction, but also the baJ^s for 
evaluating instruction (Lyons, 1972). Ammerman and Melching (1966^ triacfe^the 
interest in behaviorally state^d objectives from three independent movements 
vithin education. Tlie first derives from the vork of Tyler (195^, ly^O, 19oii 
and his associates vho worked for over 55 years at specifying the goals of 
education in terms of vhat would be meaningful and useful to the classroom 
teacher. Tyler's work has had considerable impact in the trend toward 
describing objectives in terms of instructional outcomes. 

The second development has come from the need to specify man-^machine 
interaction in modern defense equipment. Miller (ri62 was responsible for 
pioneering efforts in developing methods for describing and analysing job 
tasks. Chenzoff (rj6li'^ reviewed the then exact methods in detail and many 
more have appeared since that date. More recently Davies classified 
task analysis schemes into six categories: 

1. Task analysis based upon objc-ctives> which involves analysis of a 
task in terms of the behaviors required^ i.e.. knowledge, comprehension > etc* 

Task analysis based upon behavioral analysis, i.e., chains, concepts, 

etc. 

V* Task analysis basod on information processing needs for performance^ 
i.e., indicators, uses, etc. 

4. Task analysis based on a decision paradigm which emphasisc^s the 
judgement and docision*making rationale of the task. 

* . Task analysis based upon siibjoct matter structure of a task. 

» , Task analyvSis based upon vocational schematic^ which involve analysis 
of jobs, duties, tasks and task elements. 

The point of Davies* breakdown is that there is no one task analysis^ 
procedure* Tlie general approach is to *^gin up^' a now task analysis scheme or 
modify an existing scheme to suit the needs of the job at hand* 



the thitd developtnent was the concept of prograimied instruction which 

required the writers of programs to acquire specific information in 
instructional objectives. 

It is apparent that thes^ initial phases of development have largely 
merged^ and the use of instructional objectives has become accepted 
educational practice. A critical event in this fusion was the publication of 
Mager's (lyoS) little book PreparinR Instructional Objectives ^ In this 
work, Magor set forth the requiretnents for the form of a useful objective but 
he did not deal with the procedures by which one could obtain the infomation 
to support preparation of the objectives. A series of additional works 
including oae on measuring instructional intent (Mager, 11^73) have dealt more 
thoroughly with such issues* 

Infortnation as to the actual behaviors exhibited by an acceptable 
performer is preferred as the basis for the construction of an instructional 
objective. Ho' ^vcr, data can come from a variety of sources^ such as: 

1* Supervisor interview 

P. Job incumbent intorviev 

Observation of performer 

* Inferences based on i>ystem operation 

Analysis of ^^rcal world" u^ie of Instruction 

t.. Instructor interview 

Hie methods used to derive this data are legion and have become very 
clever and sophisticated. Flanagan's "critical incident technique*' 

and the various modifications and off -shoots it has inspired is a good 
example of an effort aimed at identifying essential performance vhile 
eliminating information not directly related to the successiul accomplish- 
ment of a job-related task* 

n^e choice of method for deriving job behavior instruction must be based 
on the type of performance and various realistic factors such as the 
assessibility of the performance to direct observation- Generally the 
solution is less than ideal, but techniques such as Anmerman and Melching's 
(l.^oi/i can be used to reviev the objectives so derived and provide a 
useful critique of the data collection method* An exhaustive review of the 
various technique.^ for deriving instructional objectives is impossible here. 
The reader is directed to Lindvall ?1.^*>^ ) and Smith for a comprehensive 

treatment of this question* 

Ammerroan and Melching 'V//^) have developed a system for the analysis 
and classification of terminal performance objectives* Ammorman and Melching 
examined a great number of objectives generated by different agencies and 
concluded that five factors accounted for the significant ways in which most 
existing performance objectives differed* These factors are: 
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1, -^ Typ^ of performance unit 

2, Extent of action description 
5» Relevancy of student action 

Completeness of structural components 

5. Precision of each structural component 

Further, Ammerman and Melching have identified a number of levels under 
each factor* For instance^ factor #1 has three levels from specific task 
vhich involves one well-defined particular activity in a specific work 
situation to generalized behavior which refers to a general measure of 
performance or way of behaving, such as the work ethic. 

^ With these five factors and the identification of levels for each 
factor, it is possible to classify or code any terminal objective by a five 
digit numbi f . Tliis scheme has high value for management control and review 
of terminal performance objectives* Airanerman and Melching feel the method 
can fulfill three main purposes t 

1, Provision of guidance for the derivation of objectives and 
standardization of statements of objectives so that all may meet the 
criteria of explicitness, relevance, and clarity, 

2. Evaluating the proportion of objectives dealing with specific or 
generalised action situations. 

Evaluating the worth of a particular method for deriving objectives. 

This is an extremely useful method, particularly where a panel of judges 
is used to review each objective* A coefficient of congruence can be 
computed between the judge placement of the objective on the five dimen- 
sions to yield a relative index of agreement* Used in this fashion, the 
Ammerman and Melching method should prove to be very useful in development 
of instructional systems. 



DEVELOPING TEST MATERIALS AND ITEM SAMPLING 

Hively and his associates provide a useful scheme for 

writing items which are congruent with a criterion, Hively* s effort has been 
in the area of domain-referenced achievement tcii^ting. In Hively's sy^^tom, 
wan item form constitutes a complete set of rules for generating a domain of 
test items vhich are accurate meavSures of an objective. Popham ^V^^'O) points 
out that this approach has met vith success vhere the content area has well* 
defined limits » In areas such ss mathematics^ independent judges tend to 
agree on vhether a given item is congruent vith the highly specific behavior 
domain^referencod by the item form* As loss well-defined fields arc? 
approached, hovever, it becomes very difficult to prepare item forms so 
that they yield test items vhich can be subsequently judged congruent vith 
a given instructional objective* Easy intei judge agreement tends to fade 
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and the items become progressively more cumbersome. Pophma (19T0) remarks: 
"Perhaps the best approach to developing adequate criterion- referenced test 
items vill be to sharpen our skill in developing item forms which are 
parsimonious but also permit the production of high congruency test items." 

Cronbach (I965, ly^fS) presents a generalizability theoretic approach 
to achievement testing. Cronbach theory presents a mathematical model in 
the framework of which an achievement test is assumed to be a sample from a 
large wcll-defined domain of items. Parallel test forms are obtained by 
repeated sampling according to a plan. Analysis of variance techniques 
V particularly intra-class correlation) are used to obtain estimates of 
components of variance due to sampling error, testing conditions, and other 
sources which may affect the reliability of the score. It should be pointed 
out that analysis of variance, when used in this fashion, is essentially a 
non-parametric technique particularly suitable for use with CRTs, Generaliz 
bility theory has been extended (Osburn, 1^68) by including the concepts of 
task analysis which allows sorting subject matter into well-defined 
behavioral classes. Osburn (ly6R) has termtid this convergence "Universe- 
defined achievement testing". Bively et al. (1^6^', iy?5) has used these 
techniques in an exploration of the mathematics curriculum. Mathematics 
represents a subject domain particularly suited to this approach and llively 
reported success as evidenced by him in the high intra-class correlations 
between sets of items sampled from a universe of items. If applicable to lo; 
well-defined content domains, this technique promises to have diagnostic 
utility and also particular^ relevance to examining the lorm of relationships 
between knowledges and skills. As yet, this extension into other subjects 
has not been undertaken 



QUALITY ASSUI^NCE 

In the view of Hanson and Berger !l</n) quality assurance is viewed as 
a means for maintaining desired performance levels during the operational 
use of a largo, scale instructional program. Tliese workers identify six 
major components in a Quality Assurance program: 

1. Specification of indicator variables, llicse are variables which 
measure the important attributes of aspects of a program and must be 
individually defined for each instructional system. 

Examples given are: 

a. Pacing--moasure of instructional time 

b. Porformance--interim measures of learning, i.e., unit tests, 
module tests, etc. 



c. Logistics— indicator reports of failure to deliver materials, 

etc. 
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2« Definition of decision ruUs« Ihe emphwis here should be on 

indicators vhlch signal a major program failure • Critical levels may be 
determined on the basis of evidence from developmental vork or on the basis 
of an analysis of progr^ needs ^ 

5» Sampling procedures. These questions must be answered on the basis 
of an analysis of the severity of effects if sufficient information is avail- 
able^ Factors to be considered include: 

a^ Number of program participants to provide data 

b^ How to allocate sampling units 

c* Amount of information from each participant 

k. Collecting quality assurance data. Special problems here concern 
the villingness of participants to cooperate in the dfta gathering effort. 
Data must be timely and complete. Hanson and Ber^r* suggest a number of 
vays to reduce data collection problems: 

a* Minimise the burden on each , participant by collecting only 
required clata. 

b. Use thoroughly designed forms and simplified collection 
procedures * 

c* Include indicators vhich can be gathered routinely vithout 
special effort, 

5* Analysis and summarization of data. Some data may be analyzed as it 
comes in; other data may have to be compiled for later analysis. The exact 
technique vill depend on the type of decision the data must support* 

6, Specification of actions to be taken* Uiis step must describe the 
actions to be taken in the event of major program tailure^ Alternatives 
should be generated and scaled to the severity of the failure s ♦ Information 
as to actions taken to correct program failures should alvays be fed back 
into the program development cycle^ Ihis feedback will be an important sourer* 
of information to guide program revision. 

Hanson and Berger offer an illustrative exsunple ot how this process might 
be implemented* They conclude by noting chat quality assurance, as applied 
to criterion-referenced programs ^ would act to ensure that the specified 
performance levels will be maintained through the life of a program* These 
notions proyide the basis of an Important concept in the implementation of 
an instructional program utilizing criterion-referenced measurement- If this 
sort of internal quality assurance program is built into the instruction, then 
the probability of an instructional program becoming '^derailed" while up and 
functioning is certainly minimized* 



DESIGNING FOR EVAIUATIOH AHD DIAGNOSIS 

Baker (197^) feels that the critical factor in instruction is not hov 
the test results are portrayed (NRT or CRT) but how they are obtained and 
vhat they represent. Baker suggests the term construct-referenced to 
describe achievement tests consisting of a vide variety of item types and 
vell^sampled content range. These tests are results of the norm- referenced 
type* Criterion-^referenced tests. Baker feels, are probably better termed 
domain-referenced tests (see dicsussion of Hively et al*, 1968^ 1975) • A 
domain specifies both the performance the learner is ' to demonstrate as veil 
as the content domain to vhich the performance is to generalize* Another 
subset of CRT is vhat Baker refers to as the Qblective^referenced test* The 
objective-referenced test starts vith an objective based on observable 
behavior from vhich it is possible to produce items vhich are homogeneous 
yet relate to the objective. Baker feels the notion of domain-referenced 
tests is more useful^ 

Each type of test vill provide different information to guide improve- 
ment on instructional systems, Construct-referenced tests vill provide 
information regarding a full range of content and behavior relevant to a 
particular construct. The objective-referenced test vill provide items 
vhich exhibit similar response requirements relating to a vaguely defined 
content area- The domain- referenced test vill include items vhich conform 
to a particular response segment, as veil as to a class of content to vhich 
the performance is presumed to generalize^ 

Baker (VJf^) then proposes a minimum set of data needed to implement as 
instructional improvement: c^^clc. 

1* Data on applicable student abilities 

r*. Ability to identify deficiencies in student achievement 

Ability to identify possible e::planation for deficiencies 

. Ability to identify alternative romodial sequences 

Ability to implement sequence 

All three types of tests provide data useful for POt 1, Construct- 
referenced tests arc probably the most readily available, but are not 
administered on a cycle compatible vith diagnosis and are reported in a 
nomothetic manner. A veil-designed objective-referenced test may bo sched- 
uled in a more useful fashion. A domain-referenced test provides enabling 
information to allov instructors to identify vhat the students vore able 
to cleal vith* Identification of performance deficiencies :set 2) is 
theoretically possible vith all three ^ots of data. Hovever, since cut-offs 
are usually arbitrary^ none of the three tests vill give adequate Information. 
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As for sets 5> k» and ^, there is little in the vay of informatioa 
yielded by any of the three tests which vould aid in these decisions. 
In addition* training research is not yet veil-advanced in these areas > nor 
does the information alvays reach the user level. In addition, incentives 
are lacking since most accountability programs are used to punish defi- 
ciency rather than to promote efficienty. Of the three test types, the 
domain-referenced tests give program developers the most assistance, for 
they are provided with clear information about what kind of practice items 
.are in the area of content and performance measured by the test. Also 
students may practice on a particular content domain without contacting 
the test items themselves. However, Baker points, out domain-referenced 
items are hard to prepare, mainly because not all content areas are analyzed 
ill a fashion to allow specification of the behaviors in the domain, as has 
been noted elsewhere. 



ESTABLISHING PASSING SCORES 

Prager, Mann, Burger, and Cross (1972) discuss the cut-off point issue 
and point out that there are two general routes to travel. Hie first method 
involves setting an arbitrary overall mastery level.. Ihe trainee either 
attains at least criterion or not. A second procedure is that of requiring 
all trainees to attain the same mastery level in a given objective but to 
vary the levels from objective to objective, depending on the difficulty 
of the material, importance of the method for later successful performance, 
etc. This second method seems more reflective of reality but as Prager et al. 
(1972) point out it is certainly more difficult to implement, let alone 
Justify, specific levels that have been decided upon, Prager et al. believe 
that for handicapped children, at least, it would be appropriate to set 
mastery levels for each child ralative to his potential. Nitko (19^71) concurs 
and suggests different cut-offs for different individuals. However, the 
feasibility of individual cut-offs seems doubtful. Lyons (191'2) points out 
that standards must take into account the varying criticality of the tasks. 
The criticality for any task is basically an assessment of the effect on an 
operating system of the incorrect performance on that task. Criticality 
must be determined during the task analysis and must be incorporated into 
the training objective. Unfortunately, in most, cases the criticality of a 
task is not an absolute judgement and the selection of a metric for criti- 
cality becomes somewhat arbitrary. 

The approach to reliability advocated by Livingston (1972) holds some 
promise for determining pass-fail scores. If Livingston's assumptions are 
accepted then it becomes possible to obtain increased measurement reliabil- 
ity by varying the criterion score. If the criterion score is set so that 
a high or very low proportion pass then we will obtain reliable measurement. 
Unfortunately, it is not often possible to "play around" with criterion scores 
to this extent. The training system may require a certain number passing 
and the criterion score is usually adjusted to provide the required number. 
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From this discussion it is apparent that there are no completely 
generalizable rules to guide the setting of cut-off scores. The cut-off 
must be realistic to allow the training system to provide a sufficient 
amount of trained manpower at some realistic level of competance. 
Training developers setting the cut-off score must therefore consider the 
abilities of the trainee population, the through-put requirements of the 
training system, the minimum competence requirement, and act accordingly. 
The use of summative try-out information should allow a realistic solutio 
to the cut-off question for specific applications. 



USES OF CRT IN NON-MILITARY EDUCATION SYSTEMS 

Prager et al. (I972) describe research on one of the first CRT systems 
(individual Achievement Monitoring System - lAMS) designed for the handi- 
capped and designed for widespread implementation, Prager et al. point out 
that standardized tests often are useless when applied to handicapped 
individuals. They are simply too global in nature to be of much use in 
directing remediation. Tests build to reflect specific instructional 
objectives are much more useful when dealing with such populations. The 
use of CRTs .also allows relating a handicapped child's progress to criterion 
tasks and competency levels. The use of CRTs is further indicated by the 
need for individualized instruction and individualized testing when dealing 
with individuals who exhibit a variety of perceptual and motor deficiencies. 
As a result of these considerations, a CRT-centered accountability system 
has been devised. This project began with the construction of a bank of 
objectives and test items to mesh with the type of diagnostic individual- 
ization peculiar to the education of the mentally handicapped. To meet 
these needs, the objectives were, of necessity, highly specified. The 
CRT-guided instructional system was geared to yield information to support 
three types of decision: placement, immediatG achievement, and retention. 
Standardized diagnostic and achievement tests were also used to aid in place- 
ment decision. Ihe system is still in the early stages of implementation so 
no comment can be made concerning its ultimate usefulness. 

More recently, Popham (lyY>) presents considerable data concerning the 
use of teacher performance tests, Tliese tests require a teacher to develop 
a "mini- lesson" from an explicit instructional objective. After planning 
the lesson, the teacher instructs a small group of learners for a small 
period of time. At the conclusion of the "rriini- lesson", the learners are 
given a post-tost. Affective information is derived by asking the learners 
to rate the interest value of the lesson. Popham reviews three potential 
applications of the teacher performance test: 

1. A focusing mechanism. To provide a mechanism to focus the teachers' 
attention on the effects of instruction, not on "gee-whiz" methods. 

2. A setting for testing the value of instructional tactics. ITie 
teacher performance test can be used as a "test bed" to evaluate the 
differential effectiveness of various instructional techniques, llio teacher 
need net be the instructor, but the important aspect of this application 
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involves a post* lesson analysis in vhich the instructional approach is 
appraised in terms of its effects on learners* 



5. A formative or summative evaluation device* Popham vievs this 
application of teacher performance tests to program evaluation to be 
extremely important > particularly in the appraisal of in-service and pre- 
service teacher education programs • 

Popham presents three in-service and pre-service applications of the 
teacher performance tests. These applications vere for the most part 
vieved as effective* Hovever, a number of problems were revealed in the 
course of these applications that may be symptomatic of performance tests in 
general. Popham found that unless skilled supervisors were used in the 
conduct of theminl-lesson^ most of the advantages of the post-lesson analysis 
were lost* Popham also found that visible dividends were gained by the use of 
supplemental normative information to give the teacher and the evaluation a 
bit more information regarding the adequacy of performance* In a similar area 
of endeavor. Baker (1975) reports the use of a teacher performance test as 
a dependent measure in the evaluation of instructional techniques ♦ Baker 
discussed some shortcomings of the use of CRTs as dependent variables. These 
shortcomings are largely based on the peculiar psychometric properties of 
CRTs* However, Baker feels that CRT is valuable for research purposes even 
with the large number of unanswered questions concerning their reliability 
and validity. Baker points out *\ * * if the tests have imperfect reliability 
coefficients in light of imperfect methodology^ the researcher is compelled 
to report the data, qualify one^s conclusions ^ 'and encourage replication.*' 
Baker also feels the use^ of teacher performance tests with the indeterminate 
psychometric characteristics is not ethically permissible for evaluation of 
individuals--at least for the present. 

In a slightly different area of application, Knipe (19J5^ summarizes the 
experience of the Grand Forks Learning System in which CRTs played a very 
salient part. The Grand/Forks School District began by specifying in detail 
the performance objectives for in most subject areas* These objective^ 

were to form the basis of a comprehensive set of teacher/ learner contracts 
as one instructional method by which students could meet the objectives. Ic 
was found that mathematics was the subject area most amenable to analysis and 
therefore received the most extensive treatment* Ihe matheisatics test 
consisted of approximately 120 criterion-keyed items for each grade level 
After extensive tryout the items were revised on the basis of teacher and 
student recommendations as well as on the basis of a psychometric analysis. 
The inclusion of psychometric analysis as a device to direct the revision 
of items seems questionable in view of the limited variance of CRTs. In 
summary, however, the teachers regarded the CRTs as useful in supplementing 
NRTs, and in addition found them useful for placement* Finally, Knlpe 
concludes^ ^'Tlie criterion-reference test is the only type of test that a 
school district can use to determine if it is working toward its curriculum 
goals J' 
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Extensive experience with use of CRT vas reported by Taylor, Michaels^ 
and Brennan (1975) In connection with the Experimental Volunteer Army 
Training Program (EVATP)^ To standardize EVATP instruction, reviews, and 
testing, performance tests covering a wide variety of content were 
developed and distributed to instructors^ The test3 were revised as 
experience accumulated; some tests were revised as many as three times* 
Drill sergeants used the tests for review or remediation, while testing 
personnel used them in the administration of the general subjects, comprehen* 
sive performance and MOS tests. The tests also provided the basis for the 
EVATP Quality Control System which was intended to check on skill acquisi- 
tion and maintenance during the training process • Unfortunately, problems 
were encountered with the change in role required of the instructors and 
drill sergeants under the system of skill performance instruction and 
training. Considerable effort was required to bring about the desired changes 
in instructor role* The CRT-based quality control system performed its 
function well by giving an early indication of problems in the new instructional 
system. Evaluation of the performance-based system revealed clear-cut 
superiority over the conventional instructional system. The problems with 
institutional change encountered by the?e workers should be noted by anyone 
proposing drastic innovation where a traditional instructional system is 
well-established. 

Pieper, Catrov, Swezey & Smith present a description of a p^^^rformance 
test devised to evaluate the effectiveness of an experimental training 
course* The course vas individualized ^ featuring an automated apprenticeship 
instructional approach • Test item development for the course performance test 
was based on an extensive task analysis. The task analysis included many 
photographs of job incumbents performing various tasks. These photos served 
as stimulus materials for the tests and were accompanied by questions requiring 
%?hat wmild I do'' responses or identification of correct vs. incorrect task 
performance* All item^ were developed for audio-visual presentation permit- 
ting a high degree of control over testing conditions. Items were selected 
which discriminated among several criteria. Internal consistency reliability 
was al^^o obtained* This effort is illustrative of good practice in CRT 
development and shows cleverness in the use of visual stimuli-- the statistical 
trcauutints »j5^ed in selecting items are^ however, questionable* A somewhat 
siiiiilar development project entitled Learner Centered Instruction (LCI) (Pieper 
e« Swezey), also describes a CRT development process^ Here, a major effort was 
devoted to using alternate form CRTs, not only for training evaluation^ bun also 
for a field follow-up performance evaluation after trainees had been working in 
field assignments for six months. 

Air Force Pamphlet 1^0-*^^ls the Handbook for DesiRners of Instructional 
Systems ^ is a seven volume doctiment which includes a volume dealing with 
CRTs^ A job performance orientation to CRT is advocated • Specific guide- 
lines for task analysis and for translating criterion objectives into test 
items are presented in "hands-on performance" and in written contexts. The 
document is an excellent guide to the basic **do*s** and **don'ts** in CRT 
construction. A similar Army document, 'TRABOC Regulation >>0*100--lj Systems 
Engineering of Training presents guidelines for developing evaluation 
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mAt«ri«ls and for quality control of training. CRTs ara used Interchanga^ 

ably with '^performance tests'^ and with ^'achievement tests'* in this documents 
The areas of CRT in particular ?ad of evaluation in general are given 
minimal coverage, CON Pam 350-11 is essentially a revision of TRADOG 
Regulation 550^100-li revised to be compatible with unit training requirements* 
This document although briefly mentioning testing and quality control, 
presents virtually no discussion of CRT% ^ 

Various Army schools have developed manuals and guides for their own 
use in the area of systems engineering of training* Hie Army Infantry 
school at Fort Benning^ Georgia for example, has published a series of 
Training Management Digests as well as a Training Handbook and an Instructor's 
Handbook s Ihere also. exist generalised guidelines for developing performance* 
oriented test items in terms of memoranda to MOS test item writers and via 
the contents of the TEC II program (Training Extension Course)* The Field 
Artillery school at Fort Sill, Oklahoma provides an Instructional Systems 
Development Course pamphlet as well as booklets on Preparation of Written 
Achievement Examinations and an Examination Policy and Procedures Guide in 
the gunnery department. The Armor school at Fort Knox, Kentucky, publishes 
an Operational Policies and Procedures guide to the systems engineering of 
training courses* Generally these documents provide a cursory coverage of 
CRT development, if it is covered at all. 

The Army Wide Training Support group of the Air Defense school at Fort 
Bliss, Texas provides an interesting concept in evaluation of correspondence 
course development. Although correspondence course examinations are 
necessari^ly p^per and pencil (albeit criterion-referenced to the extent 
possible) many such courses contain an OJT supplement which is evaluated 
via a performance test adMinistered by a competent monitor in the field 
vhere the correspondent is working. This is a laudable attempt to move 
toward performance testing in correspondence course evaluation, A supple- 
ment to TRADOG Reg 550^1C)0-1 on developing evaluation instruments has also 
been prepared here* This guide provides examples of development of evalua- 
tion instruments in radat checkout and maintenance and in leadership areas* 

A course entitled ^'Objectives for Instructional Programs*' (insgroup, 
VJ7?^ which is used on a number of Army installations has provided a dia- 
graimnatic guide to the development of instructional programs. CRT is not 
covered specifically in this document, nor is it addressed in the recent 
Army "state-of*the-art*' report on instructional technology (Branson, Stone, 
Hannum, and Raynor, 1975)* However^ a CISTRAIN (Coordinated Instructional 
Systems Training) course (Deterline A Lenn, 1972a, b) , vhich is also used 
at Army installations for training instructional systems developers, does 
deal vith CRT development and, in fact, provides instructions for vriting 
items and for developing CRTs, Hie study guiae (Deterline and Lenn, I97?b^ 
deals with topics such as developing criteria* identifying objectives, 
selecting objectives via task analysis, developing baseline CRT items, 
revising first draft items and preparing feedback* This document provides 
a good discussion of CRT development in an overview fashion- 



W*S* Army Fl^lJ Hanual 21*6 (SO January 19^7) provides trainers and 
instructors of U%S* Army in-service schools with guidance in the preparation 
of traditional instruction^ e*g*, lecturer, conferences, and demonstrations. 
FM 21-6 (20 January I967) contains a great deal of information on construc- 
tion of achievement tests but the "why^s" and **hov^s*' are largely lacking. 
The section on performance testing seems designed to discourage the construc- 
tion and use of performance tests* In addition^ the manual is weak on task 
analysis procedures procedures in general lack definition of method. All 
testing concepts are directed at the construction of norm-referenced tests 
of either job knowledge or performance, lliere is no discussion of how to 
set cut-offs^ or any discussion of the issues peculiar to CRT, The emphasis 
is on relative achievement « Recently ^ FM Sl-6 has undergone comprehensive 
revision to suit the needs of field trainers* The revised manual (1 December 
1975) is generally in tune with contemporary training emphasis with consider- 
able information on individualized training and team training. In particular^ 
the extensive guidance provided on objective generation should prove very 
useful to field trainers. While the revised JM 21-6 does not specifically 
refer to GRT^ the obvious emphasis on NRT which distinguished the earlier 
version is gone^ A possible weakness in the revised version is the tacit 
assumption that all trainees will reach the specified standard of perfor- 
mance. Although the requirement that all trainees reach criterion is not by 
itself unreasonable ^ practical constrained of time and cost sometimes 
dictate modified standards, e,g,> SO^ reaching criterion. Where it is not 
feasible to wash-out or to recycle trainees^ then remediation must be designed 
to permit an economical solution. IM 21-6 does not seem to address the 
remediation problem* In general, though, FM 21-6 is a good working guide to 
field training. It will be interesting to see how effective it is in the 
hands of typical field training personnel* 

.From these limited examples it appears that the civilian sector has led 
in the development and use of CRTs* Although the EVATP effort is a notable 
exception> the use of CRTs in military operations has been slowed by the 
high initial cost of developing criterion^referenced performance tests. 
Often, the use of CRTs for performance as^^essment has required operational 
equipment or interactive simulators^ drastically raising costs. School systems 
have had success with CRTs, largely due to the nature of the content domains 
chosen* These content domains heav y emphasise knowledge; hence tests can 
be paper and pencil which are cheap :o administer* A solution to the cost 
problem may be found in the notion of Osborn (liffO) who has devised an 
approach to "synthetic performance tests" which may lead to lowered testing 
costs, although little concrete evidence has appeared in the literature to 
date. 



INDIRECT APPROACH TO CRITERION-REFERENCING 

Fremer (l97^) feels that is is meaningful to relate performance on 
Survey Achievement tests to significant real- life criteria, such as minimal 
competency, in a basic skills area. THiB author discusses various ways of 
relating survey test scores and criterion performance* All of these 
approaches are aimed at criterion-referenced interpretation of test scores • 
Premer proposes that direct criterion-referenced inferences about an exam- 
inee's abilities need not be restricted to tests that are composed of 
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actual samples of the bahavior of Intarest. Framer faels that considerable 

use can be made of the relationships observed among apparently diverse 
tasks vi thin global content areas* Fremer further argues that tasks vhich 
are not samples^^f an objective may provide an adequate basis for generali- 
zation to tha^ objective. Fremer notes that given a nearly infinite popula- 
tion of objectives, the use of a survey instrument as a basis for making 
criterion- referenced inferences would allow increased efficiency^ 

An example is offered of the use of a survey reading test to mal^w 
inferences about ability to read a newspaper editorial • A CRT of ability to 
read editorials might consist of items quite different from the behavior of 
interest* Fremer offers an illustrative example of using vocabulary test 
scores to define objective-referenced statements of ability to read edito- 
rials. Fremer notes , however, that the usefulness of interpretive \ 
tables, i.e., those that provide statements referencing criterion behaviors 
to a range of test scores, depends heavily on the method used to establish 
the relationship between the survey test scores and the objective-referenced 
ability. As essential aspect would be the use of a large and broad enough 
sample of criterion performance to permit generalization to the broader ^ 
range of performances. Fremer *s example provides for the definition of 
several levels of mastery and points out that an absolute dichotomy, mastery 
versus non-mastery, will seldom be meaningful. It is difficult to under- ^ 
stand why Fremer makes this statement, as the basic use of CRT is to decide 
whether an individual possesses sufficient ability to be released into the 
field or requires further instruction. Many levels of performance can be ' 
identified, but are ultimately reduced to pass-fail. Mastery /Non-Mastery, 
Fremer apparently bases his objection on measurement error which can render 
classification uncertain. However, as discussed earlier^ proper choice of 
cut-off and careful attention to development should minimize classifica- 
tion errors. Fremer proposes that the notion of minimal competency should 
encompass a variety of behaviors of varying importance— the metric of 
"importance will vary with the goals of the educational syst^^m. 

Fremer (VJJ^) proposes a method for relating survey test performance 
to a minimal competency standard that would involve a review of the propor*- 
tion of students at some point in the curriculum who are rated as failures. 
This should serve as a rough estimate of the proportion of students failing 
to achieve minimal competency* It would then be possible to apply this 
proportion to the score distribution for the appropriate test in a survey 
achievement test^ clearly a normative approach* A second approach to 
referencing survey achievement tests to a criterion of minimal competency 
would be to acquire instructor judgement as to the extent to which individual 
items could be answered by students performing at a minimal level* By 
summing across items, it would be possible to obtain an estimate of the 
expected minimum score, Fremer, however, recognizes the limitations of 
this latter process with its high reliance on informed judgement, A further 
method proposed by Fremer seeks to define minimal competency in terms of 
student behaviors » Ihe outcome of this method would be the identification 
of bands of test scores that would be associated with minimal competency* 
The processes involved in this method also rely on informed judgment, though* 
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Another method proposed by Fretner to criterion^referenced survey 

achievement tests involves developing nev tests with a very narrow focus ^ 
i.e., a smaller area of content and a restricted range of difficulty. It 
should not be necessary to address every possible objective • However^ it 
should be possible to develop a test composed of critical items by sampling 
from the pool of items. The next step in the process would involve relating 
achievement at various curriculum placements between the focused test and 
^-the survey instrument* Ihis should allow keying of the items on the survey 
test with specific critical objectives. 

Still another method put forth by Fremer to get from criterion- 
referenced to survey tests is the stand-alone work sample test* This 
technique is intended for use when there is an objective that is of such 
interest that it should be measured directly* Ihe procedures that Fremer 
puts forth are very clever in concept and are .aainly applicable to school 
systems and traditional curricula where well-developed survey instruments 
exist. Even so, considerable work is involved in keying the survey instru- 
ment* In non-school system instructional environments, dealing with non- 
traditional curricula^ it is unlikely that an appropriate survey instrument 
would exist. 



USING NRT TO DERIVE CRT DATA 

Cox and Storrett (Ij-O^ propose an interesting method for using NRTs 
to provide CRT information. Tlie first step in this procedure is to specify 
curriculum objectives and to define pupil achievement \s^ith refotoncc to 
these objectives. Hie second stop would involve coding each standardised 
tost item with reference to curriculum objectives. With coded tost items 
and knowledge of the position of each pupil in the curriculum> it is possi- 
ble to determine the iteai^s validity in the sense that pupils should bo able 
to corr'ectly ansx>?cr items that are coded to objectives that have already been 
coveted* Step three is .he scoring of the test independently for each pupil. 
Caking into account his position in the curriculum* ITie authors recoiTOend 
that this model is particularly applicable to group instruction, since place-- 
ment in the curriculum can generally be regarded as uniform. Tlierofore, it 
is possible to assign each pupil a score on items whose objectives he has 
covered • It is also possible to obtain inforitiation on objectives which wore 
excluded or not yet covered. 'Hiis method seems an economical way to extract 
CRT information and NRT information from the same instrument. The technique 
has yet to be explored in practice, however. 

CONSIDEFATIONS FOR A CRT 1^^1^LE^ENTAT10N MODEL 

The development and use of CRT is a fairly recent development in instruc- 
tional technology. Partially as a rusult of this, there is no comprehensive 
theory of CRT such as exists for NRT, Hence > the concepts of validity and 
reliability for CRT are not yet veil developed, although definition of tho.^e 

>ncepts is necessary to reduce errors of classification* llie need for 
content validity in CRT is> hovcver^ well recognized. In addition^ there is 
no sin^*1e CRT construction methodology which vill serve for all content 
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domains^ Unresolved questions also revolve around the question of 

Bandwith fidelity and the use of reduced fidelity in criterion-referenced 
performance tests. 

Ihe rationale for the use of CRT in evaluating training programs and 
describing" individual performance is well established. To ensure best 
possible results, the military or industrial user should exert evei^ effort 
to maintain stringent quality control > including: 

!• Careful task analysis: 

a* Observation of actual job performance vhen possible 

b. Identification of all skills and knowledge that must be trained. 

c. Careful identification of job conditions 
d* Careful identif cation of job standards 
e. Identification of critical tasks. 

2, Careful formulation of objectives 

a. Particular care in the setting of standards 

b. Identification of all enabling objectives 

c* Independent check on the content of the objectives 
d* Special attention to critical tasks. 
Item development 

a. Detormino if all objectives must be tested 
b» Survey of resources for test 
Determination of item torm 

d. Statement of rules for items 

e. Dcvelopinent of item pool for objectives to be tested ^ 

f . Develop tryout plan and criteria for item acceptance 
Tryout of items 

h* Revision and rejection of item.^. 



Particular c^re must be exercised in setting iteiR acceptance criteria 
for item tryout. Ihe use of typical NRT item statistics should be minimized. 
Ihe usual methods are totally inadequate, i.e»» intarnal consistency 
estimates are only suitable with large numbers of items; in addition, internal 
consistency may not be an important consideration. Traditional stability 
indexes may also be inappropriate due again to small numbers of items and 
reduced variance. The technique proposed by Edroonston et al, (I972) may prove 
effective in reducing errors of misclassification due to inadequate test items. 

By adhering to strict quality control measures » it should be possible to 
obtain a set of measures that have a strong connection with a specified content 
domain. Whether or not they are sensitive to instruction, or if they will 
vary greatly due to measurement error is unknown. Careful tryout and field 
follow-up may currently be the best controls over errors of misclassification 
due to poor measurement, Ihe ethical question of the use of measures with 
unknown psychometric properties in making decisions about individuals remains to 
be addressed. 



COST-BENEFITS CONSIDERATION 

Although, the costs of training and the costs of test administration can 
readily be quantified in dollar terms, we lack a proper metric to completely 
assess the costs of misclassification. Emrick (ly?!) proposes a ratio of regret 
to quantify relative decision error costs. Emrick' s metric, however » appears 
rather arbitrary and in need of further elaboration. The probability of mis- 
classification is the criterion against which an evaluation technique must be 
weighed. The results of misclassification range from system-related effects 
to Interpersonal problems . In some instances where misclassification resulns 
in a system failure, cost can be accurately measured, and is likely to be high. 

A relative index of cost can be gained from the task analysis. If the 
analysis of the job reveals a large number of critical tasks or individual 
tasks whose criticality is groat, then the cost of supplying a non-raaster can 
be assessed as high, and great effort is justified in developing a training 
program featuring high fidelity, costly CRT. Where the analysis does not 
reveal high numbers of critical tasks, the cost then becomes a function of 
. less quantifiable aspects. Misclassification also results in job dissatisfac- 
tion and morale problems evidenced by various symptoms, of organizational 
illness, e.g., absenteeism, high turnover, poor work group cohesion, etc. 

A possible solution to the cost-benefit dilemma may come from work with 
symbolic performance tests and the work cited earlier showing that job knowl- 
edge tests can sometimes suffice. The use of symbolic tests and/or job 
knowledge tests would result in greatly reduced testing costs in many instances. 
The decision as to the appropriateness of the test must be made empirically 
on the basis of well controlled tryout with typical course entrants, llie 
development of symbolic performance tests may prove to be difficult. Much is 
yet to be known about how to approach this development. If progress can be 
made in lowering the cost of CRT then the problem of cost-benefit analysis 
will be made in lowering the cost of CRT then the problem of cost-benefit 
analysis will bo largely obviated. 
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As the question currently stands » there Is no doubt chat CRT provides 

a good basis for evaluation of training and the determination of vhat a 
trainee can actually do. If the system in vhich the trainee must function 
pxroduces a number of critical functions which will render misclassification 
expensive » then CRT is a must. 



PART S»*SURVEY OF CRIIERIOH-REFEREMCED TESTIMG IN 1HB ABHY 



PURPOSE AND METHOD OF THE SURVEY 

In order to survey the application of criterion- referenced testing 
techniques in the military, a number of Army installations were visited. 
Information vas collected to supplement the literature search ati review, 
to provide detailed material on CRT development and use in the Army, and 
to obtain information on attitudes and opinions of Army testing personnel. 

Sjiecifically, the survey gathered data on: 

1. Hov CRTs are developed for Army applications . In order to 
create a CRT construction manual vhich will be useful to Army test devel- 
opers, it is necessary to determine how CRTs are currently developed in the 
Army. Additionally, it is important to determine differences in test devel- 
opment strategies across Army installations, so that the manual can suggest 
procedures "which will mate well with a variety of approaches. 

2. Mow CRTs are administered in various Army contexts . This 
Information is important since design for administration materially affects 
the test construction process. Design information is important in creating 
guidelines on development of CRTs, in order to make them suitable for 
administration in diverse, Army testing situations. 

5. How CRT results are used in the Army . The way in which a test's 
results arc used is a factor that must be considered In the development of 
any test. Hence, the survey obtained data on use of test results in a 
variety of Army testing situations. 

if . Extent of criterion-referenced testing in the Army . This includes 
information on extensity--how prevalent criterion-referenced testing is in 
the large. Army-wide sense*, and information on in tensity- -how much testing 
in specific Army contexts is of a criterion- referenced type. 

J* The level of personnel vho vill use the CRT Construction Manual 
developed by the project . Hiis information includes educational levels ^ range 
of ftdlitary experience," and familiarity vith psychometric concepts • Such 
information is designed to help tailor the manual to its audience* 

* 6» Problems encountered by Axmy testing personnel in Che develop * 
ment and use of critorion-referenced tests > Information on problems serves 
tvo purposes. First, the identification of typical problem areas points 
the way toward future research on criterion- referenced testing. Second^ the 
CRT Construction Manual can deal with typical problems, offering suggestions 
for avoiding or surmounting them* 



^ 7' Attltttdtt of Amy fating peraonn el toward the dcvftlopafent and 

i««.ofjCRTs» It is important to assess existing attitudes toward CRTs 
among Army testing personnel, since level of acceptance Is an indicator of 
spread and utility of a nev concept. Additionally, attitudinal data vill 
enable the CRT Construction Manual to address current attitudes, and thus tc 
attempt to rectify poor attitudes based upon misconceptions. 

^« The prob able future course of criterion«referenced testing in 
the Army. Interview data, particularly that collected from personnel at 

— auperKiaory les^ls^ indicate-probable trends -in- future Army- CRT -use ^ Alsoy 

problems In implementing CRT applications suggest needed research. 

9' Sample Army CRTs and problems in developing and using them > An 
important part of the on-site survey is to gather materials to serve as the 
basis for examples of CRT develojipent and use. 

Interview Protocol Development , lu order to gather these types of infor- 
mation, an interview protocol for on-site use at various Army posts was 
developed. Development of the protocol included several review phases during 
which revised versions of the protocol were prepared. The second version of 
the protocol consisted of three forms: One to be used in interviews with test 
constructors, another for t^st users, and a third to be used with supervisory 
personnel. The final instrument combined these forms and included several 
optional items for use in interviews with personnel who were especially 
knowledgeable about criterion-referenced testing. The final version of the 
protocol was found to have high utility, since it can be used to structure 
interviews with personnel who serve any of three functions (test construction, 
test use, and supervision). The protocol provides flexibility in the range 
of topics to be discussed in an interview, thereby allowing interviews to be 
tailored to the ranges of responsibilities, experience, and knowledge 
possessed by individual interviewees. Appendix A of this report is a copy 
of the final version of the protocol. 

The interview protocol was used in a series of one-to-one interviews 
conducted during January, February and March VJ7k . Installations surveyed 
during this period included the Infantry School at Fort Benning, the Artillery 
School at Fort Sill, the Air Defense School at Fort Bliss, the Arwor School 
at Fort Knox, and BCT and AIT units at Fort Ord. In addition, test-related 
departments wore surveyed at each post. A total of 10^. individuals were 
interviewed. 

Survey Teams . A survey team spent three days at each post surveyed. The 
interviews ranged in duration from approximately one-half to three hours 
apiece and averaged about one and one-half hours. Interview length was at 
the interviewer's discretion, based on the utility of the information obtained 
from a subject. 

Summarios of the types of personnel surveyed at each installation, 
presented in a following section of this report, indicate each interviewee's 
position in the organization for Army School, MOS, TEC, and Training Center 
. testing programs, and whether the individual is a tost doveloper/usor 'tost 
administrator, test scorer, ctc.^ or a supervisor of test construction or use. 
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Each Interviewee responded to most of the items on' t^e protocol. 
Responses vhlch are easily and neaninsfully quantifiable are presented In 
tablas in the following section. Other itens elicited opinions, anecdotal 
information, process information, and other data that are not easily 
quantifiable. Such data are summarized by extracting and comparing verbal 
descriptions and are also discussed in the next section of this repart. 



RESULTS AND DISCUSSION 

Sample « Table 1 presents a summary of the individuals interviewed at 
Forts Benning, Bliss, Sill, Knox, and Ord» Of IO5 individuals interviewed, 
more than half were personally involved in constructing, administering, 
scoring, or making decisions based on test scores. Ihe remaining individ- 
uals surveyed were supervisors of personnel who constructed or used tests. ^ 

Table 1 also identifies four categories of subjects: School personnel 
(infantry, Artillery, Air Defense, and Armor), Military Occupational 
Specialty (MOS) Test personnel (groups involved with the development and 
administrattoa of annual MOS tests), Training Center personnel (BCT and AIT), 



1 

Also included in the survey was a visit to the U.S. Army Southeastern Signal School (USASESS) and 
the U.S. Army Military Police School, Fort Gordon, Georgia. Contractual time constraints did not 
allow the application of the formal survey protocol. Following is a summary of the findings at Fort 
Gordon. 

Test Quality Control at USASESS is conducted on both an internal and external basis. Internal control 
entails examining tests constructed by the academic departments for consistency using Evaluation 
Planning Information Sheets (EPISs). These documents are in turn, examined for consistency with 
Training Analysis Information Sheets from which they are derived. Examinations are supplemented by 
direct observation of on going tests, to ensure that requirements in the test administrator's manual are 
being met (i.e., that appropriate tasks, conditions, and staridards are being employed). 

External quality control is maintained through the use of questionnaires which ask field unit respondents 
to indicate the actual Job value of tasks on virfiich they were trained in school. The questionnaires are 
follovwd up by direct interviews with school graduates in the field. Additional quality control infor- 
mation is obtained via communication with field unit commanders. 

Among the problems noted were: (1) concern for lack of adequate criteria in training the "soft skills", 
such as counseling and leadership; and (2) ambiguity in existing regulations are open to varying interpreta- 
ttons by different schools and by individuals within schools. 
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•ttd IBC (Training Extension Court*) Program pertonnel. No Training Center 
data were collected at Fort Bennlng, vhlle Fort Ord data were exclusively 
with Training Center programs. 

A total of 67 indlviduials vere interviewed In School organizations « 
This focus on school personnel is appropriate since the CRT Construction 
Manual vill be used primarily in the schools « It Is interesting to note 
that of the 79 subjects who vere asked if special triiinine were available 
for testing personnel ^ almost SOjt responded yes* Ihis does not mean that 
805^ of the subjects asked had received such training, but that training in 
testing techniques is available in the Arn^. Many Individuals lijho partici- 
pated in the survey were experienced in constructing or administering tests, 
and several had received special training in testing. For a more detailed 
analysis of the subjects and their organisational positions, see Appendix B, 

Tables 2 through 7 present sunvparies of responses to quantifiable 
proto ol items. The data upon which these summaries are based are in 
Appendix Kote that since interviews were tailored to address the knowl- 
edge and experience of the individual, not all subjects were asked all items. 
For example if it was established that an individual was not Involved in 
test development but in test administration or in use of test results, that 
individual was not queried concerning test construction. Hence, in Table 2, 
for example, a maximum of 87 individuals responded to a given item. 



Test Development s Table 2 summarizes responses to protocol items 
concerning involvement with various steps of CRT development. Details ot 
Army test construction processes vary widely; however, some impressions of 
the test construction process can be gained from Table 

The data presented in Table 2 are subject to interpretation. For 
example, although slightly over half of the ^0 subjects answered "yes" to the 
protocol item about using an item analysis technique (item yb), further 
questioning during the interview usually revealed that they were not using a 
formal item analysis technique* Instead, they typically inspect a computer 
printout of percent right and wrong responses to items on a test. Items 
having an unusually high number of wrong responses are reworked or discarded. 

After the final test items are selected, Army test developers usually do 
nor assess reliability and validity, at least in a strict psychometric sense. 
Instead, the tests are ac^ministered several times and items that cause a 
great deal of difficulty are reviewed to see if they are constructed 
properly--a relatively informal process. 



SURVEY OF CRITERION-REFERENCED TESTING IN IRE ARMY: 
SUBJECTS INTERVIEWED AT FORT BENNINO* FORT BUSS, 
FORT SILL, FORT KKOX» AND FORT ORD 
(N - 105a) 
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Total Number of Supervisors (S) Interviewed: hk 



Total Number of Test Developers /Users (TDU) Interviewed: ul 



Table 2 

INVOLVEMENT IN VARIOUS STEPS OP TEST DEVELOPMENT: 
SUMMARY OF RESPONSES ACROSS ALL POSTS 



Item 
No. 


Brief Statement of Item* 


Number of 
Subjects 
Responding 
to Item 


Percent 
of 

"Yes" 
Responses 


h 


Have you been included in vriting 
objectives ' . 


t6 


78 


kh 


Do you vrite objectives in opera- 
tionai , Denaviotai t6rms? 




71 


5 


Have you participated in setting 
standards? 


69 


77 


6 


Have you participated in imposing 
practical constraints? 


72 


68 


7 


Have you helped determine priorities? 


70 


67 


Q 
O 


Have you been included in writing 
test items? 


68 


70 


Hb 


Do you vrite item pools? 

Have you been involved in selecting 
final test items? 


50 


66 


>b 


Do you use an item analysis technique 






11 


Do you measure test reliability? 




55 


lib 


Do you compute coefficients of 
reliability? 


k2 


To 


12 


Do you aid in validating tests? 




5> 


12b 


Do you use content validity? 


kl 


V> 



For complete vording of the protocol items » see Appendix A 



It appears that relative care la taken in Amy test development 

programs to select and define objectives and their associated conditions 
and standards^ Some care is t^en in writing items to match these 
objectives. From this point on, however, empirical rigor is lacking j that 
is, formal item analysis and assessment of test reliability and validity 
are infrequently done* 

Test Administration . Table 5 presents subject responses to protocol 

items dealing with test administration* A large proportion of subjects in 

the Survey have been involved in administering tests* This is not surprising 
since much test development is done by school instructors; thus, individuals 
who create test items also administer the tests in their classes. These 
are heartening data: It is advantageous for test developers to be familiar 
with test admnistration situations, since it gives them increased 
familiarity vith the conditions and limitations inherent in such situations. 



Table 5 

INVOLVEMENT IN ASPECTS OF TEST ADMINISTRATION: 
SU>1MARY OF RESPONSES ACROSS ALL POSTS 



Item 
No. 


Brief Statement of Itetn^ 


Number of 
Sub1e^^? 
Responding 
to I tem 


Percent 
of 

"Yes" 
Responses 


IC 


Have you participated in adminis- 
tering tests? 






10b 


Do you ever use the '^assist method^^? 




Gj 


15 


Do you use "go-'no go*^ scoring 
standards? 


100 




Ihh 


Do you retfest trainees who fail 
the first time? 




•a 


^ For 


complete wording of the protocol items. 


see Appendix A 
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Table ^ also shows that an ^*as$ist" method of scoring is frequently 
used^ It appears that test administrators often find it appropriate to 
provide help to individuals taking the test. The actual percentage of 
test administrators usirig a true assist method is probably somewhat lower 
than that shown in Table 5, since a good number of those who stated that 
they use this method indicated that they provide help only if testees have 
difficulty with ambiguities in test language or instructions* In a true 
assist method j help is given to those individuals who can not perform a 
particular item for whatever reason ♦ Such a method is often used in cases 
vh-ere the testee could not otherwise complete the test (e^g.^ a checkout 
procedure)^ 

Less than half of the 100 subjects queried said that they usei go^no go 
scoring standards on their tests. This does not imply that more than half of 
the individuals in our survey necessarily use normative scoring standards; 
instead) many use point scales for scorings 

Over 70^ responded that trainees who fail a t^st the first time are 
retested* There are many cases where retesting is done. For example, in BCT, 
AIT and other hands-on performance testing situations, trainees are often 
given second and third chances to pass particular performance items. 

Uses of Test Results . The primary use of test results is, of course, 
to evaluate individual performance. This is true whether the test is 
Griterion-^referenced or normatively based. There are, however, other ways 
in which test results can be used* Table ^4 presents a summary of responses 
to protocol items dealing with various uses of test results* Table h shows 
that the most common uses of test results, other than for evaluation of 
trainee performance, are for improving training and for diagnosis. Test 
results can diagnose areas in which an individual is weak and in need of 
remediation • Seventy- two percent of the subjects questioned indicated that 
they use test results for diagnostic purposes* Dia>jiosis is usually done 
informally: Instructors review test results and then confer with trainees ♦ 

Test results can also be used to assess course adequacy in the formative 
evaluation sense. Seventy- three percent of the subjects questioned indicated 
that they use feedback from the tests to improve courses ♦ The way in which 
this feedback is used varies widely. For example, some senior instructors 
indicated that if many trainees from a particular instructor's class perform 
poorly on certain parts of a test, they would first evaluate the instructor* 
If several classes taught by different instructors scored poorly on a section 
of a test, the senior instructor might review the materials used in that 
portion of the course. In other situations, the test itself is reviewed 
using feedback from the students. For example, if a test item is unclearly 
worded or if the performance called for is unclear » student feedback is a 
valuable tool* 



Table k 

USE OF tEST RESULTS OTHER THAN EVALUATING INDIVIDUAL PERFORMANCE; 
SUMMARY OF RESPONSES ACROSS ALL POSTS 



Item 
No. 


Brief Statement of Item^ 


Subjects 
Responding 


rercenu 
of 
"Yes" 


Ik 


Do you use test results to compare 
trainees? 


91 


65 




Do you use test feedback to improve 
courses? 




75 


16 


Do you use test results for diagnostic 
purposes? 


95 ; 


72 


as 


Are you familiar with team performance 
testing? 




k2 ■ 



For complete wording of the protocol 1 terns ^ see Appendix A 



Less than two-thirds of the subjects questioned indicated that test 
results are used to compare trainees* Comparing individuals on the basis of 
test results is essentially norm^referenced^ It is possible however, to 
employ CRTs for norm- referenced purposes. In BCT, for instance, trainees 
who pass the comprehensive performance test on their first try might be ^ 
considered for promotion from El to EP., while those who do not may not be 
so considered* 

Considerably less than half of the subjects questioned said that they 
were familiar with team performance testing situations* Further, of those 
who indicated familiarity with the concept, many indicated that team perfor- 
mance testing is often individual evaluation in a team conte:.r.. Actually, 
the testing of team performance, was very limited on the Army p6sts visited. 

Types of Tests. Table V; shows a description of types of tes.ts constructed 
or used by subjects in our survey sample, based upon their responses to. 
protocol item 2j . Part 1 is a categorization according to test riiode, Part 2 
according to test use- For both parts, subjects were asked to indicate the 
approximate percentage of each typo test with which they were involved. 
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Table 5 



TYPES OF TESTS CONSTRUCTED OR USED: 
SUMM^Y OF RESPONSES 
TO PROTOCOL ITEM $7 
ACROSS ALL POSTS 



Item 27 - Part 1 
N - 95 



What proportion of the tests you have participated in making or \jsing 

Mean Response 

A, Paper-and-pencil knowledge tests? kl ^2% 



Simulated performance tests? (e.g.> using 
mockups and drawings) 



7.C 



C, ^'Hands-on*' performance tests? 1^1, 

D. Other? 



Total: 100^ 



Item 27 - Part 2 



What proportion of the tests you have participated in making or using arc 
for: 

Mean Response 

A. Specific skill and knowledge requirements? 5^,4^^ 
B* Specialty areas in a course? T*'^^ 
C. End of block within a course? '}0.Q% 
Mid cycle within a course? 

Irt 111. lUi/ 

E. End of course? yj_Mp 

Total: loa^ 



It «ppe«r« th«t ttost tetta ar« eith*r |>Ap«t^«nd^p«iicll kaawl^dgtt 
tt»ti or hatid8*on petfotmance tetts^ Although Table ^ itidicates that 
paper-and^pencll knovladge tests are nearly ^0^ of those created and 
used^ many subjects confused paper«and*pencll knowledge tests vith paper- 
and^penell performance tests ♦ This vas learned from discussions vith 
tntervievees* In many areas t paper-and-pencil tests are equivalent to 
the performance called for in the actual task situation. For example, 
such diverse areas as map-tnaking and aiming artillery require paper-and- 
pencil performance. Haps must be drawn to scale , while in many cases the , 
aiming of artillery requires mathematical computations^ It is estimated 
that about hal^ of the responses in the paper <*and^pencil knowledge test 
category actually referred to paper-and -^pencil performance testing, Ihus^ 
responses to Part 1 of Item 27 can be interpreted to indicate that nearly 
three-quarters of the tests constructed or used are performance tests of 
one sort or another* Iheae results accord with the emphasis on perfor- 
mance testing) and indicate that performance testing has become widespread 
in many phases of Army evaluation. 

Responses to Part 2 indicate that tests measuring specific skill and 
knowledge requirements, and those used at ends of blocks of instruction^ 
account for about 70^ of test construction and use* Mid-cycle tests and 
end-of-course tests together account for less than one-quarter of the 
tests. Responses to Part 2 of Item 2j indicate that tests are well 
distributed throughout instruction. Ihis is good news si^nce frequent 
testing can provide frequent feedback and the possibility for on-going 
remediation. 

Problems. Table 6 presents a sunmiary of responses to protocol items 
dealing with problems in the development and use of CRTs* Over tvo- thirds 
of the subjects (vho were primarily supervisory personnel for this item) 
indicated that increased expense may be a problem in the development and 
use of CRTs. Several subjects commented that the extra expense may be a 
factor in reducing the availability of CRTs in the Army. However, many 
individuals indicated that increased expense is a short-term factor, and 
that in the long run, criterion-referenced testing is less expensive than 
is norm^referenced testing* Criterion* referenced testing is presumably 
less costly in terms of insuring the efficient output of well-trained 
soldiers. 

Many individuals in the survey sample felt that time pressures, or 
other constraints, often prevent successful construction and use of tests. 
In discussion^ subjects indicated that time pressure is the most common 
constraint, and that time pressures^ are usually present In test development. 



^ISNERAL PROBLEMS IN 1H£ DEVELOPMENT AND USE 
OF CRITERION-REFERENCED TESTS: 
SUMMARY OF RESPONSES ACROSS ALL POSTS 



Item 
No. 


Brief Statement of Item^ 


Number of 
Subjects 
Responding 
to Item 


percent 
of 

^•Yes^* 
Responses 


30 


Have time pressures, or other con* 
straints prevented successful test, 
test construction and use? 


89 


61 


31 


Have you seen tests vhich vere 
unsuitable for their intended uses? 




57 


35 


Are Criterion -Referenced Tests more 
expensive to develop and use than 
norm-referenced tests? 




71 



For con^lete wording of the protocol items, see Appendix A 



However, time pressures and other constraints do not usually interfere with 
test administration tasks. Usually, tests are administered satisfactorily 
despite time pressures* Interviewees seemed to think that Army test devel- 
opment and administration have improved greatly in recent years* 

Attitud^e^ ^ table 7 presents a summary of subject attitudes concerning 
criterion-referenced testing in the Army* In general » subjects were in - 
favor of the Army trend toward criterion-referenced testing* Conunents 
included: **Criterion* referenced testing is the best system of testing yet 
devised**; **lt is the only way to go'*; "It is a terrific improvement over 
testing in the old Army**; **Cri terion-referenced testing should be used 
exclusively in the Army and wherever else possible, including civilian 
educational institutions J' Eighty-eight percent of the individuals 
responding felt that criterion-referenced testing should receive high or 
top priority in terms of Army assessment programs. Sixty percent felt that 
criterion-referenced ' tests should replace most or all norm*referenced tests. 

Subjects felt that criterion-referenced testing is practical and useful 
in measuring job performance skills* No other item on the survey protocol 
elicited a 100%^ positive response ♦ In addition, many Individuals felt that 
criterion-referenced testing would be useful and practical for measuring 



ATTITUDES CONCERNIHG CRITERION-I^FERENCED TESnKG! 
SUMMARY OF RESPONSES TO PROTOCOL 
ITEMS Jif m> kO ACROSS ALL POSTS 



Item 5^ 

Hov strongly do you feel about future use of Criterion^ Referenced Testing 
in the Army? Should Criterion-Referenced Test development receive high 
or lov priority in terms of Army assessment programs? 

N « 80 



Percent Responding 
to Each Alternative 

1 Strongly against**Criterion-Referenced Testing should 

receive bottom priority, or dropped entirely* 

1 , Against-Crlterion-Referenced Testing should receive 

lov priority. 

10 Neutral- •Criterion-Referenced Testing should receive 

average priority. 

For--Criterion*Referenced Testing should receive high 
priority. 

^^0 Strongly for--Criterion-Referenced Testing should 

receive top priori ty, Criterion-Referenced Tests 
should replace most or all norm- referenced tests. 

Total: lOOf^ 



Item ;^^0 

Do you feel that Cri terion-Roferonced Testing is practical and useful in 
measuring job performance skills? 

Number of Intervievees Responding « &4 
Percent responding ^*y^s** ^ 100 



«r««« other than Job i»erform«nc« tkilU. XnowUdgt tetts> for «}c«mpl«, 
%f«r« 8««n by m«iy as a practical and useful application of the criterion- 
referenced concept. 



DISCUSSION OF CRT SURVEY 

Over 150 hours of interviews vere conducted during the survey of 
criterion- referenced testing in the Army. Topics covered ranged from the 
extent, utility, and practicality of CRT use in the Army, to problems in 
implementing CRTs. 

Although criterion-referenced testing is used in today's Army, many 
^ NRTs are in use also. This is not surprising, since criterion-referenced 
testing is a relatively nev concept. It was apparent from the survey, 
however, that CRT use is Increasing. 

At each installation visited, criterion-referenced testing was in 
evidence. The combat ar^ns schools vlsited--Infantry, Armor, Artillery and 
Air Defense— develop and use a number of CRTs. However, school implementa- 
tion of criterion- referenced testing is in the beginning stages. Some 
departments are making serious attempts to incorporate CRTs, while others 
are only minimally Involved. Many employ criterion-referenced terminology, 
but do not produce true CRTs. This is especially true in "soft skill" 
areas, such as tactics and leadership. Most academic departments within 
these four combat arms schools indicated that many of their tests, especially 
the written ones, are graded on a curve. Much reliance appears to be placed 
upon subjectively graded paper-and-pencil tests and upon computer-graded 
objective tests. 

MOS testing continues to be primarily norm-rcf crenccd , Most, if not all, 
MOS tests rely on situational multiple-choice items. Because of the low 
fidelity of such items, it is often difficult to determine if they are 
cr\terion- or norm-referenced. On the surface, at least, they are suspi- 
ciously similar to conventional knowledge test questions. 

Consideration of the CRT concept is being given to Training Extension 
Course packages. The optional "audio-only" performance test appended to 
such TEC packages requires further development and implementation so that 
TEC instruction can be more thoroughly evaluated in a criterion-referenced 
fanhion. 

At Fort Ord, California, CRTs are employed both in BCT and in AIT. 
/.ithoiigh there are problems in the administration of the Comprehensive 
Performance Tests (a type of CRT used toward the end of basic training) the 
testing experiences at Fort Ord should be able to serve as a good "field 
laboratory" for developing CRT applications. 
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AIT in dlv«r«« *r««t such field viring and £ood services appears to 
be benefiting from Oie nse of CRTs, Preliminary indication* are that more 
soldiers are being evaluated more effectively through the application of 
criterion-referenced testing. Further, instructors, supervisors and 
students all appear to be favorably disposed toward CRTs. 

In general* although criterion-referenced testing is not extensive, there 
are tnany instances of serious attempts at CRT development and use at the 
Army installations visited. 2 There was much respect for the utility and 
practicality of criterion- referenced testing. As noted, many interviewees 
were strongly in favor of increased use of criterion-referenced testing in 
the Army. Many who had experience with developing or using such tests 
indicated Increased evaluation effectiveness, increased Individual morale 
and, in the long run, reduced expense as a function of CRTs. Despite this 
high regard, there vas too little rigorous development or application of 
CRTs. While progress is being made toward achieving rigor In "hard skill" 
areas, especially in equipment-related skills, attempts in "soft skill" 
areas are lacking. Personnel who develop tests for such areas in many cases 
are attempting to develop CRTs, but are diverted at the outset since genuine 
difficulties in specifying objectives explicitly are often encountered. 

The survey revealed virtually no evidence of criterion- referenced testing 
in team performance situations. In fact, as many subjects pointed out, 
operational units are not formed until after AIT. This does not mean, 
however, that CRT development for unit performance is inappropriate. Such 
tests could be developed and used in AIT and then exported to field units. 
Although problems may occur when an individual begins to work within a field 
unit, this is not an argument against unit CRTs. 

The CRT Construction Manual. Subjects at all levels indicated a need 
for increased development and use of criterion-referenced testing in the Army. 
Many indicated the need for guidance in constructing and in adifti^istering 
CRTs. A consensus indicated that such guidance should be writte^ in simple, 
straightforward language and should address criterion- referencf^if testing in 
a non- theoretical, practical manner. Individuals interviewed in the survey 
indicated that a manual of this type would be well received at all levels 
in test development and evaluation units. 



Many of the personnel interviewed confused CRTs with "hands-on" pefformance testing. In terms of 
imptementing hands-on performance testing programs, the trend at the Army posts visited is dramatic; 
many such tests are tn evidence. Not ail of these tests are criterion^referenced, however; many are not. 
In order to be called criterion-referenced, an individual testee's skills or knowledges must be compared to 
some external standard. This means that test items must be matched to objectives which are derived from 
valid performance data. This is not the case for a significant proportion of the "hands on" performance 
tesu presently used at the sites surveyed. 




Test Development Process > A number of difficulties in CRT development 
«nd use vete observed »nd/or described during the survey. First, the 
development of CRTs oust be derived from vell-specif ied objectives which 
ar€, in turn, the results of careful task analyses. Unfortunately, task 
analysis data are not available in many cases, and in cases vhere they are - 
available, they are often disregarded. Many test developers vrite state^ 
ments of performance standards from Plans of Instruction (POIs^ or from 
tZI ? ^^^'^^ Schedules. In most cases, these POIs and schedules are based 
upon task analyses. However, often the critical source data are not 
readily apparent. In other cases, objectives are defined "out of the blue" 
^ by subject matter experts vho may be unfamiliar with the instructional 

kIvX ''^ T"'/"""?^^' ^^^'^ y^^' ^"^"^ ^^^^^ task analyses 

have been developed and then ignored. For example, in one AlT course 

visited, a careful task analysis had been conducted which accurately docu- 

irented critical behaviors. Although the performance tests used m the course 

were developed from objectives derived from the task analysis, the recency 

talf ana^vtlcl/tf ''a'' ^""^^ ^"^^^ ^^^^^^ contradicted, the 

task analytic data. As a result, the revised subject schedule required 
testing skills that the task analysis had revealed are performed very 
infrequently; but did not mention other skills which, according to the task 
analysis, were most frequently performed. 

Many difficulties in CRT development can be overcome if task-analytic 
tr\oZ^T aevelopment of tests. When tests are modified 

fccess tt^ZTt^'T' '^^^ responsible for the modification should have 
access to the same task-analysis data. 

£iac^ical Constraints . The CRT survey suggested that priorities and 
practical constraints for task objectives are usually assessed informally, 
of IT.. accurately assessed and defined, the development 

^f^r^? r^^"^^ achievement of objectives is exceedingly 

difficult If all objectives are taken to be of equal weight, then they will 

'^"^^'^ """^^^^ i^'- fact, mo^e 

important objectives may require more thorough testing. 

nnlv^'fr"^''^^^ practical constraints to the testing situation are considered 
should r?l^f M ^^''' Constraints which operate in the testing situation 
should rightfuUy be considered while a test is being developed. Some 
rlltTfl ^^""^ /^^V ^^^^^"^ (SMART) books, for example, show a minimal 
ItlT T^lZ.T .-IT'^"^^ constraints, 'il^ey contain lengthy checklists 
.lift t .^f possibly of use in evaluating an individual's performance, 
cannot be followed by test administrators. In some cases, one tester may 
administer a SM.MIT test to many soldiers simultaneously, ;i though totally 
unable to observe all items on the SMART checklist. I^us, at a given 
testing station, a particular soldier may be scored as a "no go" while another 
soldier may be scored "go" because the tester could only observe one 
accurately 'lite problem of including practical testing constraints and 
task priorities can be solved by training test developers to consiLrthese 
as an integral part of the test development process. ' 



Itea PooU ♦ Test developers seem to heve little difficulty creating 
Items If die perfonaences, steaderds*. end conditions ere eccuretely 
specified. However, many Army teet developers surveyed indlceted that they 

vrote only the precise number of items required for a specific test. These • 
items are typically reviewed by subject matter experts and are then revised 
accordingly. If alternate forms of a test are required, a pool of items 
are constructed such that a computer can format alternate test forms by 
selecting a subset of items from the pool. Rarely are extra items written. 
Accordingly, there is no empirical- selection process for final test items. 
Items are typically dropped or revised, after a review, if large numbers 
of individuals in a class answer them incorractly. 

Creating a test item pool should become a standard part of the test * 
development process. If twice as many items are developed as are needed 
for a specific test, the test can be tried out and the final it«»!i«<s selected 
empirically. An empirical item analysis strategy should be incorporated to 
select final test items. Although the creation of item pools and the use of 
item analysis techniques may introduce added expense into the test develop- 
ment procedure, the payoff should outweigh the expense. The payoff here is 
the development of items that are feasible and which reliably address appro- 
priate criterion behaviors. 

Reliability and Validity . A major omission in the development of CRTs, 
as observed during the Army survey, is the lack of test evaluation* • There 
was virtually no consideration of test reliability and/or validity. This 
does not indicate that the tests as developed are unreliable, but that the 
question has not been addressed. A few subjects did indicate that content 
validity had been considered by virtue of careful matching of test items 
and task objectives. Content validity however, is not necessarily the only 
type of validity appropriate for CRTs. Predictive validity can also be 
assessed. That is, trainees can be tested using CRTs and then evaluated 
under field conditions performing the tasks for which they have been trained. 
Test results for a valid test should be congruent with later field perfor- 
mance results. 

Army test developers should be instructed in techniques for establishing 
reliability and validity of CRTs. Even if a test evidences content validity 
as a function of careful creation based upon task objectives, reliability is 
still in question. If a test cannot be administered reliably, results are 
meaningless. 

Administration . A poorly administered test defeats long hours of careful- 
test development. The CRT survey indicated that lack of standardized 
testing condit,ions exist in many areas. , This is in part attributable to 
lack of training in test administration for testers, and in part to lack of 
clearly defined test administration instructions. 

One administrative problem observed was that soldiers may be aided or 
hindered as a function of their position in the performance testing line. 
Those who are not first in line "get a break" by observing mistakes »f others. 
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The test administrative conditions should specify that trainees waiting, 
to be tested remain at a certain distance from the test site, or the test 
administrators should be instructed in conducting such tests in standard* 
ised manner, or both. 

Careful instructions in test administration are necessary to insure 
accurate testing. Steps should be taken to insure that test administration 
practices are clearly defined for each test, and that test administrators 
are adequately trained. Further, test sites should be regularly inspected 
to insure that tests are being given under the specified standard conditions. 
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APPENDIX A 



IKtBRVirar PROTOCOLt 
SURVEY OF CRXmiOH«R£FERENCED USTXNG IM TBE ARMY 



* » Optional question: Ask 
as appropriate 



Name of Interviewee: 
Mailing Address: 



Telephone Number; . 

Introduction. Interviewer vill : 

A, Introduce himself 

B, Introduce ASA 

C, Explain that ASA is doing contract work for the Army Research Institute 

State that ASA is interested in improving tests for the Army 

E. Explain that ASA vants to find out about current status of testing in 
the Army so we can determine what we can build on ^-^^ 

!• What is your position in the organization here? 

What school or center are you in? 

What is your directorate > department, 
or unit? 

What is your branch or section? . 
What is your position and title? 



2. How long have you been involved in testing? Year s Months 
5* What did you do before you became involved in testing? 



Iiitftrviwy Stat€ff^f^t: How^ I would like Co discus* vit^ you, some Usks 

ttay b* involved in test cons true tion wid use« ihese tssks «te done in 
different vays ir different pieces « Sometimes they ere combined, in other 
ceses some ere eliminated* they often go by different names. Would you 
please tell me which of these you are involved in* 

^* Writing oblectives . that is* •determining what the test will measure and 
the conditions under which the measurement will occur in terms of 
precise, behavioral statements. 

Have you been invblved in writing objectives? Ye s H p 

if ves t (a) how long have you been doing this? Year s Months 

(b) do you write objectives in operational, behavioral terms? 
Yes No Don^t understand 



5* Setting standards . That is--defining the standards against which per- 
fomance is evaluated • In many cases, these standards are very similar 
to tha stated objectives. 

Have you participated in setting standards? Yes N o 
If yes t how long have you been doing this? Years Month s 

Iinposlng practical consvraints . That ls--deciding how the test must be 
built so it can actually be used within the limits of the situation for 
which it is designed. For example * there are often time constraints 
involved in testing complex skills* 

Have you been involved in this? Ye s K o 

It ves ^ how long have you been doing this? Years Month s 

7* Detertnining priorities > That is--deciding how important each standard is 
In relation to other standards* 

Have you helped determine priorities? Ye s Ho ^ ^ _ 

If yes > how long have you been involved in determining priorities? 

Years Months 



t5. Writing items. That is — creating items for use in the test. 

Have you written, or helped to write items? Yes No 

If ves > 'a) how long have you been involved in writing items? 

Ye ar s Hon th s 

(b) does your gjroup of items usually contain more than will be 
included in the tost? Yes No _ Don^t know 



9. Selecting final test Items . Uiet ie^^i^plying statltticel tests to 
4«e*]Cwiiu» tlie iiK>ft useful^ ium^re4uii4int it«ms« 

Have you been involved In selecting final test itemst Ye s No^ 
If yes, (a) for how long have you done such vorkt Year s Months^ 
(b) do you use an itein analysis technique? 

Yes ^ K o Don't kno w_ ^ _ 

lO* Test administration ^ Hiat is*-administerlng- the test in the situations 
for vhich it vas planned* Also^ test administration is often done as a 
try-out, before the test is finalized. 

Have you participated in adodnistering tests? Yes N o 



If yes ^ (a) for how long have you done so? Year s Months^ 



(b) have you ever found it appropriate to give help to'scnneone 
taking the test if they could not continue without help on 
a particular item? Yes N o Don^t know 



>^ II* Measuring reliability . That is-*determining if a test vill give similar 
scores when measuring similar performance* For example, a person taking 
equivalent versions of the same test should score about the same on both, 
if he has had no practice in between^ 

Have you been involved in measiiring the reliability of tests? Ye s No^ 
If yes> (a) how long have you been involved in measuring reliability? 

Ye ar s..,. ^^ ., Mon th 

(b) dc you compute coefficients of reliability? 
Yes N o. Don't kno w 

X2. Evaluatin;> validity The test developer must determine whether the test 
is actually measuring what it is supposed to measure. Personnel who score 
high on the test should also perfottn very veil on the task that test is 
supposed to measure, while those who score low should not be able to 
perform the task as well^ 

Have you helped to validate tests? Yes No 



If yes , (a) how long have you been doing so? Year s Months^ 



(b) do you use content validity as opposed to predictive validity? 
Yes No Don^t know 
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6-^ 



3-5 • Scorlug. Haw are tests generally scored? Are tiorms set as standards 
using bell shaped c\irves» or are ''go^tio go'^ t^e standards used? 

^o^ s go-no g o Other 



To vhat uses are the test scores put? 

1^ • One might be usin g test results to compare student performance , Higher- 
scoring students might be considered for promotion for example, vhile 
those passing with a lower score might not be so considered. 

L'o you test results to compare students? 

Yes No 

If yes, (a) how long have you used test scores for comparisons? 

Years . Months 

(b) if a student doesn't get a passing score the first time^ is 
he tested again? Yes No Don't know 



Another use mii^ht be using test results to evaluate course adequacy ^ 
Sometimes the results of tests are used to evaluate the success of a 
course. Portions of a test that many students fail to perform well on 
are seen as reflecting a deficiency in the corresponding portion of a 
course* Courses can then be improved ^ using test results as feedback. 

Have you used test results to help improve courses? Ye s No 

If yes > (a) how long have you been doing so? Years Months^ 

(h) when you do so, are test criteria based on task objectives, 
rather than on course content? Yes No Don't know 



16* Another use might be using test scores to diagnose areas in which students 
needed improvement. 

Do you use tests for diagnostic purposes? Ye s No 

If yes , how long have you been doing this? Years^ Months^^ 



17. Are there other aspects of test development and use that you are aware of 
but 1 did not mention? Yes_^_^ N O 

If yes J what are they? 
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Interviewer Sutement: Kow X would like to discuss some of the tasks that 
you're involved in. 



19* What inputs do you have available in terms of documents, data, job aids, 
field manuals, etc? REQUEST THESE 



20. Which of these inputs do you actually use? 



flf answer to 20 is other than "all of them", interviewer asks #211 
Why do you use these and not the others? 



22. What products do you prepare? REQUEST THESE 



25. How are these outputs used? 



&i- . What problems have you encountered? 



25, How did you resolve these problems? 
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*26, Is any special training available for testing personnel? Ye s Mo^ 
If yes , pleaie briefly describe this training? 



27, What proportion of the tests you have participated in making or using are: 
A* Paper*and*-pencil knowledge tests? . 
Simulated performance testst e,g», using 

mockups and drawings .^.^ 

C* ^'Hands on" performance tests t 

Other? Specify: 



What proportion of the tests you have participated in making or using are 
for: 

A. Specific skill and knowledge requirements? 

Specialty areas in a course? 

End of block within a course? 
Mid cycle within a course? 
End of course? 



^28. Are you familiar with any team performance situations that were evaluated 
by tests? Yes No^ 

^29* Would you briefly describe how tests were used to measure team performance? 



50* Have time pressures^ or otiier constraints^ prevented you frcan successfully 
carrying out some ^f the tasks involved in test construction and use? 

Yes N o^ ^ ^ ^ 

If yes ^ describe how you were affected by a constraint- 
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Can you describe any cases in which, testa vere developed vhich were not 
suitable> in your opinion) for the intended uses? Yes^_ N o _ 

Description: ^ 



If it is the interviewer's opinion that interviewee 
does not understand the distinction between Criterion 
Referenced Testing and norm-referenced testing: 



STOP HERE 



Otherwise go on^ 



One of the main purposes of our work for the Army is to develop a manual 

on how to construct Criterion-Referenced as opposed to Norm-Referenced 

Tests, WIio will be the primary users of a manual of this type on this 
post? 



As you know, in recent years the Army has put increasing emphasis on using 
Criterion-Referenced Tests in appropriate testing situations* Hiere Is 
still much disagreement^ though, about what a Criterion-Referenced Test 
really is. How is the term ''Criterion-Reterenced Test*' used on this post? 



How strongly do you feel about future use of Criterion-Referenced Testing 
in the Army? Should Criterion-Referenced Test development receive high 
or low priority in terms of Army assessment programs? 

S trongly against--Criterion-Referenced Testing should receive bottom 
priority J or dropped entirely. 

A gainst— Criterion-Referenced Testing should receive low priority* 
Neutral — Criterion-Referenced Testing should receive average priority* 




^__^For— Criterion-Referenced Testing should receive high priority. 

S trongly for— Criterion-Referenced testing should receive top 

priority, Criterion-Referenced Tests should replace most or all 
norm-referenced tests. 

^55* I>o you think cost is a major factor in determining vhether Criterion- 
Referenced Tests are developed and administered in the Army? That is*-have 
you found that Criterion*Referenced Tests are more or less expensive to 
develop and administer than conventional^ norm-referenced tests? 

Less expensiv e About the sam e More expensiv e 

^56* Gould you describe a situation in vhich a Criterion-Referenced Test vas 
found to be prohibitively expensive to develop? 



57* you think that there are any particular advantages or disadvantages to 
developing and using Criterion-Referenced tests in the Army (as opposed 
to norm-referenced measures)? Yes^ N o 

What are some advantages or disadvantages? 



38. Are there any special problems you have encountered while developing or 
using Criterion-Referenced Tests ^ as opposed to problems normally 
encountered vith norm-referenced tests? Yes No 

If yes > describe these special problems and how you overcome them: 



^39* How serious are these problems? That is, how much do they affect the 
overall accomplishment of testing objectives? 
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1^0. Do you feel that Criterion»Referenced Testing is practical and useful in 
measuring lob performance skills? Ye a N o 

Why? . . . . , 



Are there other areas {such as knovledge tests and achievement tests) where 
this concept could be useful? Ye s N o 

Why? 



l|2. What should we include to make the manual useful? 




APPENDIX B SlWMAItY OP TYPES OF PERSONNEL INTERVIEWED AT ARMY INSTALLATION 



Table B-1 
FORT BENNING INTERVIBI^S 



Classification Area 


Directorate » 
DeDartment or Division 


Job Title of Interviewee 


U*S* Army Infantry 


Directorate of Educational 


Deputy Director 




School 


Technology 








"Faml t'V DevelODinGnt 


Chief 


<s) 




Tii xfi f\T\ 




(DU)* 






Senior Instructor 






Instructor 


m 






Instructor 


(DU) 






Instructor 


(DO) 






Student 


(DU) 




Brigade & Battalion 








Operations Department 








(BBOB) 








Operations & Training 


Chairman 


(s) 




Techniques 








Tac tics Gro ud 


Test Officer 


(S) 






Project Officer 


(DU) 




Combat Support Group 


Instructor 


(DU) 






Instructor 


(DU) 






Instructor 


(DU) 



•>f^Supervisors of Test Development ~ (S) 
*Test Developers or Users « (DU) 



Table B-1 (continued) 



_ Classification Area 


Directorate, 
Department or Division 


Job Title of Interviewee 


U.S^ Army Infantry 
Sg hool ( c on t inu ed ) 


Directorate of Instruction 
Evaluation Division 


Chief 

Evaluation Staff 


(S) 
<DU) 




Curriculum Division 


Director of 
Instruction 


(S) 




Office of Directorate 
of Doctrine & Training 








Task Analysis Di^^ision 
Training Management 
Team 


Chief 
Chief 


(S) 
(S) 




Office of Medical Staff 
& Operations 








Instructional Division 


Chief 

Chairman^ Resident 
Committee 


(DU) 
(DU) 




Weapons Department 
Mortar Coiiimittec 


Instructor 


(DU) 


TEC Program 




Ch i e r 


<S) 


nOS TostrinR Program 




Chief 


(S) 



Table B-S 
FORT BLISS INTERVIEWEES 



Classification Area 


Department or Division 


Job Title of Interviewee 


U*S» Army Ait Defense 


High Altitude Missile 


Training Specialist 




School 


Department 






Chief Project Officer for 






Curriculum 


(S) 






Training Specialist 


(DU)* 




Missile Electronic & Con- 


Technical Publications 






trol Svs terns DeDartmenf 


Editor 


(S) 






Instructor 


<DU) 




Command & Staff Department 


Chiefs Command & Leader- 








ship Division 


(S) 






Instructor 


(DU) 






Department Staff 


(DU) 




Anny-wlde Trainins^ Support 


Educati onai St>ecialisc 


(DU) 




Division 








Educat ional ST>ecial is t 


(DID 






Assistant Chi ef of 








Course Development 


(DU) 




Low Altitude Air Defense 


Inst ru c tor 


(DU) 




Depar cment 








ttiiicrucLor & lecnniCrii 








Writer 


(mn 






Department Staff 


(DU) 




Ballistic Mi55Sl3e Defense 


Training Special ist 


(DU) 




Department 








Instructor 


(DU) 




Deputy Commandant for 


Exocutivo Officer 


(S) 




Training & Education 








Staff 


(S) 



^^Supervisors of Test Development = (S) 
*Test Developers or Users = (DU) 



Table B-2 (continued) 



Classification Area 


Directorate 1 
Department or Division 


_ Job Title of Interviewee 


U*S^ Army Air Defense 
School {continued) 


Office of the Coiranandant 


Education Advisor 


(S) 


TEC Program 


Training Development 
Division 


Chief of the Division 

Chief Project Officer 
for TEC Production 

rr eject uxricer 

Project Officer 


(S) 

(S) 

(DU) 

(DU) 


Training Center Program 


Air Defense Artillery 
Training Brigade 


Training Coordinator 

Instructor 

Evaluator 


(DU) 

(DU) 
(DU) 




Table B-^ 
FORT SILL INTERVIEWEES 



Classification Area 


Directorate > 
Department or Division 


job Title of Interviewee 


U.S. Army Field 
Artillery Training 

School 


Tactic Combined Arms 
Department 


Chiefs Associate Arms 
Division 

Senior Instructor 


<DU)* 




Gunnery Department 


Chief 1 Exam Branch 


<S) 






Ins true tor /Grader 


(DU) 




Office of the Commandant 


Education Advisor 


(S) 




Office of the Deputy 
Assistant Commandant 
for Training & Education 


Educational Snecialist 
Educational Specialist 


(S) 

(s) 




Materiel & Maintenance 
Department 


Chiefs Cannon Division 
Instructor 


(S) 
(DU) 




Target Acquisition 
Department 


Supervisory Training 
Specialist 

Instructor 


(S) 
(DU) 




Coiranand^ Leadership and 
Training Department 


Senior Instructor 
Senior Instructor 


<DU) 
(DU) 




Communications/Electronics 
Department 


Training Instructor 


(DU) 


MOS Testing Program 


Evaluation Brigade 


Chief, MOS Analysis 


(S) 


Training Center Program 


Advanced Individual Train- 
ing Brigade 


Officer in Charge 
Senior Instructor 
Instructor in Charge of 


(S) 
(DU) 
(DU) ^ 



♦^Supervisors of Test Development » (S) 
*Test Development or Users = (DU) 
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Table B-J (continued) 











Classification Area 


Department or Division 


Job Title of Interviewee 


TEC Program 


Army-Wide Training Support 


Chief of Dei>artfflent 


<S) 




Department 








Chief, TEC Branch 


(S) 






Educational Specialist 


(DU) 
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FORT KNOX INTERVIEWEES 




Clasftification Area 


Directorate, 
Department or Division 


Job Title of Sfttervlewee '\\ 


u^S* Army Armor School 


Directorate of Training 


Chiefs Task Analysis 

Division (S)^* 
T-st Director, MOS 

Evaluations (S) 




Leadership Department 


Instructor, System and 
Procedures Branch <DU)^ 




Army Wide Training Support 


Chief, Development 

Division (S) 




Directorate of Instruction 


Chiefs Instructs t 

Technology Divii>j.ua (S) 

Instructor, Instruction 
Technology Division (DO) 

Educational ^i^eciallst. 
Evaluation Branch (S) 

Chief, Curriculum 

Branch (s) 




C and S Department 


Chief, Cavalry Branch (DU/S) 

Senior Instructor, 
Small Unit Tactical 
Operations (DU) 




Automotlye Department 


Chief, Quality Control 
Brinch ($) 




Weapons Department 


Training Administrator (DU) 


Training Ceitter 


Headquarters 1st AIT 
Brigade 


S-3 1st AIT Brigade (S) 



•^^^Supervisors of Tes*: Development = (S) 
^Test Developers or Users = (DU) 
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FORT OBB INTERVIEWEES 



Classification Area 


uepartTnenc or i/xvision 


Job Title of Interviewee 


U»S. Army Training 








Center 








Directorate 


Ouallty LontroL crancn 


Chief, Quality Control 




of Plans and 




Branch 


Training 












Training Evaluator, 








Quality Control 








Branch 


(S) 




Basic Combat Training 


Project Test Officer^ 






Testing 


Quality Control 








Branch 


mf 






Instructor, Proficiency 






Test Branch 


(DU) 


Basic Combat 


Training Command (Prov) 


Operations and 




Training 




Training Officer 


(S) 




Training Brigade 


Battalion Comniander 


(S> 






Battalion Executive 








Off icer 


(S) 






Company Commander . 


(S) 






Company Commander 


<S) 






Of f icer-in-Charge . 








First Aid Commit. tee 








Group 


m) 






Instructor, First Aid 






r 


. Committee Group 


(DU) 



^-^Supervisors of Test Development = (S) 
*Test Developers or Users * (DU) 



Table B-^ (continued) 



Classification 


Directorate > 
Departnent or Division 


Job Title of Interviewee 


Ba5ic Combat 






Honcommlssioned Officer 




Training 






in-*Charge of Individual 


(continued) 
















Senior Drill Instructor 


(DU) 








Drill Instructor 


<DU) 


Advanced 




Division 


Chiefs Field Wlretnan 




Individual 






Training Division 


(S) 


Training 
















Instructor^ Field 










Wireman Training 










Division 


(BU) 




Food Services 


Division 


Supervisor, Food 










Services Division 


(5) . 








Instructor, Food 










Services Division 


(DU) 
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